add rank <=0 when val log
Scallions opened this issue · 13 comments
Thanks for your awesome work. When i run the trian.py with multi card, I find the tb_logger have a bug with rank > 0.
Here, rank is checked.
train.py#L125
if rank <= 0:
logger.info('Number of val images in [{:s}]: {:d}'.format(
dataset_opt['name'], len(val_set)))
But not check in here.
train.py#L215
# tensorboard logger
if opt['use_tb_logger'] and 'debug' not in opt['name']:
tb_logger.add_scalar('psnr', avg_psnr, current_step)
It should change to
# tensorboard logger
if rank <= 0 and opt['use_tb_logger'] and 'debug' not in opt['name']:
tb_logger.add_scalar('psnr', avg_psnr, current_step)
the train_ClassSR has same bug.
Thank you for your reminder~ I will fix them.
The validation is always running in a single card, so the rank can be checked or not checked.
https://github.com/Xiangtaokong/ClassSR/blob/8410d47e18f1371b9567fb224e83537c51d60cc9/codes/train.py#L62-L84
the tb_logger only created when rank <= 0, but in below the tb_logger will undefined when rank > 0
https://github.com/Xiangtaokong/ClassSR/blob/8410d47e18f1371b9567fb224e83537c51d60cc9/codes/train.py#L178-L217
You are right, (noitce: ### validation # does not support multi-GPU validation ) the validation is always running in a single card so the rank must be <= 0 in validation (in the below code).
I have another question about the iter number of the pretrain sr module, the classsifer and the finetune in training. I can't find it in paper.
4.1.3 Training Details
Line8 -- pretrain : 500K (original network is 1000K)
Line13 -- Class Module : within 200K
finetune: 500K, see figure6
other details are in code setting file (yml), following the default setting is ok
Have a new problem. When i train the SR module with 64 features, the loss suddenly become large at around 100k iters. I found others had same problem with RCAN. So i want ask for some advice to you. Thanks!
Generally, this problem occurs only during training of RCAN. This is because RCAN is large and complex. It crashes easily.
This is my training curves of RCAN…………
But you can still train the model following the last failure iters (using xxx.state), even using a script to resume the training when it became failed. (It will overcome that failure……)
Finally, I suggest you to use small branches such as FSRCNN, its training will be much faster and easier.
Thanks for your timely reply. I will try it.
Sorry, it's a typo… train_ClassSR_RCAN.yml change train: l1w: 250 to 4. I have updated the file.
PS:
原因是RCAN原作在网络训练时没有把像素值归一到0-1区间,而其他几个网络都是0-1区间的,我们遵循了原作,所以导致l1_loss的值是其他网络的255倍左右,所以l1w应该是1000\255约等于4(我们实际训练的也是用的4),但是重构github代码时一时大意想成了1000\4,就是250了…。我已经去跟比赛方联系了