XPixelGroup/HAT

训练iteration数?

yumath opened this issue · 4 comments

yumath commented

options里设置的是50万iter,batch size为4的话,basicsr会把总iter数转为epoch数进行训练。https://github.com/XPixelGroup/BasicSR/blob/033cd6896d898fdd3dcda32e3102a792efa1b8f4/basicsr/train.py#L48

以DF2K数据集为例:
Training statistics:
Number of train images: 144147
Dataset enlarge ratio: 1
Batch size per gpu: 4
World size (gpu number): 1
Require iter number per epoch: 36037
Total epochs: 14; iters: 500000.

但是训完14个epoch时,iter数只到了135,100,且eta: 3 days,远远没到50万iter

[train..][epoch: 14, iter: 135,000, lr:(2.000e-04,)] [eta: 3 days, 16:37:09, time (data): 0.863 (0.002)] l_pix: 1.3308e-02
[train..][epoch: 14, iter: 135,100, lr:(2.000e-04,)] [eta: 3 days, 16:35:37, time (data): 0.861 (0.002)] l_pix: 1.2130e-02
End of training. Time consumed: xxx
Save the latest model.

我这种情况是训完成了吗?还是没到50万iter?
现在训出来的结果,也跟paper report的差异很大

yumath commented

简单点说,就是预设了训练50万iter,但只训了13万5千就结束了。这是正常的吗?

yumath commented

是需要调大dataset_enlarge_ratio吗

#26 (comment) set small batch size


should I set batch_size to 1?

Ok, I found the bug. https://github.com/XPixelGroup/HAT#how-to-train says distributed call hat/train.py, but I have not. so opt['world_size'] be setting to 1, cause mismatch between iters and epochs, insufficient training.