xgastaldi/shake-shake

Why is the test top1 different?

Closed this issue · 4 comments

I changed the shakeshakeblock.lua, then run the code with 400 epochs.
The next picture was the log text within the process of training.
a

After 400 epochs train, the code CUDA_VISIBLE_DEVICES=0,1,2,3 th main.lua -dataset cifar10 -nGPU 4 -testOnly true -retrain ./checkpoints/model_best.t7 run.
The result was Results top1: 3.670 top5: 0.020

Qestion1: Why the test top 1 of "testOnly" was less than the test top1 in the process of training.
Qestion2: What was the difference of the best test top1 , last epoch's test top1 and the "testOnly"'s top1?
Qestion3: Because the top1(3.67) was generated by the network's model, can I consider my model had the performance :top1 3.67?

Question 1: It's probably because you have to set -shareGradInput to false and -optnet to false as well when you use -testOnly true to test a saved model
Question 2: You have to use the last test top1 because you don't have a validation set on CIFAR-10. When you have a validation set, you usually find the epoch that gives you the best error on the test set (let's say epoch 1715) and report the validation error associated to that epoch (validation error at epoch 1715) and this even if the validation error is not the best one. You do that to get an estimation of the error rate you would get if you test your best model on unseen real life data. You can't do that if you don't have a validation set. If you report the best test error you will probably overestimate the effectiveness of your system. That's why you have to choose a random epoch once your model has converged and report the error rate at that epoch. The last epoch is just a convention. You could also use the average of the last 10 runs but the difference is usually minimal.
Question3: See Question 1. You should get the same error rate as when it was run the first time.

I retried the codes.
1 shareGradInput true -optnet true
CUDA_VISIBLE_DEVICES=0,1,2,3 th main.lua -dataset cifar10 -nGPU 4 -shareGradInput true -optnet true -testOnly true -retrain ./checkpoints/11-7/11-7d/model_best.t7 ,
The result was error: cannot use both -shareGradInput and -optnet . It was reasonable.

2 shareGradInput false -optnet false
CUDA_VISIBLE_DEVICES=0,1,2,3 th main.lua -dataset cifar10 -nGPU 4 -shareGradInput false -optnet true -testOnly false -retrain ./checkpoints/11-7/11-7d/model_best.t7
The result was Results top1: 3.670 top5: 0.020

3 shareGradInput true -optnet false
The result was Results top1: 3.670 top5: 0.020

4 shareGradInput false-optnet true
The result was top1: 5.030 top5: 0.110
However the result "top1 5.03" was Obviously too low.

My training code was CUDA_VISIBLE_DEVICES=0,1,2,3 th main.lua -dataset cifar10 -nGPU 4 -batchSize 128 -depth 26 -shareGradInput false -optnet true -nEpochs 400 -netType shakeshake -lrShape cosine -widenFactor 2 -LR 0.1 -forwardShake true -backwardShake true -shakeImage true.

Q1:
Why was the top1----3.67 still less than every top1 in the log file?
Q2:
Do you think the expected top1 error at the parameter "testOnly shareGradInput false -optnet false" should be 3.74?

Thank your reply and your sharing again. I try to improve your code. Now I had achieved test top1 2.80 in Epochs 900.

You should try to test the saved model with CUDA_VISIBLE_DEVICES=0,1,2,3 th main.lua -dataset cifar10 -nGPU 4 -shareGradInput false -optnet false -testOnly true -retrain ./checkpoints/11-7/11-7d/model_best.t7
If the results are still different than the test error at the last epoch, you should try with only 1 GPU e.g. CUDA_VISIBLE_DEVICES=0 th main.lua -dataset cifar10 -nGPU 1 -shareGradInput false -optnet true -testOnly false -retrain ./checkpoints/11-7/11-7d/model_best.t7

With regards to the code you are using, you should make sure that you are using the latest version on Github, it could also be that you are using an older version that was not saving the last model. If you use my latest code, your saved model should appear as model_400.t7 instead of model_best.t7.

In any case, the right error is the one in the log file at the last epoch

Thank your reply.
I tried your suggestion of using "nGPU 1". The result of using "nGPU 1" was same with the result of "nGPU 4".

No matter whatever I ingnore this problem. I use the test top1 of last epoch.
You can close this issue. About another question I open a new issue.
Thanks again.