add rank <=0 when val log

Question

add rank <=0 when val log

Scallions opened this issue 3 years ago · 13 comments

Thanks for your awesome work. When i run the trian.py with multi card, I find the tb_logger have a bug with rank > 0.
Here, rank is checked.
train.py#L125

            if rank <= 0:
                logger.info('Number of val images in [{:s}]: {:d}'.format(
                    dataset_opt['name'], len(val_set)))

But not check in here.
train.py#L215

                # tensorboard logger
                if opt['use_tb_logger'] and 'debug' not in opt['name']:
                    tb_logger.add_scalar('psnr', avg_psnr, current_step)

It should change to

                # tensorboard logger
                if rank <= 0 and opt['use_tb_logger'] and 'debug' not in opt['name']:
                    tb_logger.add_scalar('psnr', avg_psnr, current_step)

the train_ClassSR has same bug.

Answer 1 · 2021-09-01T14:04:27.000Z

https://github.com/Xiangtaokong/ClassSR/blob/84c5aee68b08312e7c95fde478d3b39865d4c804/codes/train_ClassSR.py#L326-L328

some typo

Answer 2 · 2021-09-01T14:20:18.000Z

Thank you for your reminder~ I will fix them.

The validation is always running in a single card, so the rank can be checked or not checked.

Answer 3 · 2021-09-01T14:39:56.000Z

https://github.com/Xiangtaokong/ClassSR/blob/8410d47e18f1371b9567fb224e83537c51d60cc9/codes/train.py#L62-L84
the tb_logger only created when rank <= 0, but in below the tb_logger will undefined when rank > 0
https://github.com/Xiangtaokong/ClassSR/blob/8410d47e18f1371b9567fb224e83537c51d60cc9/codes/train.py#L178-L217

Answer 4 · 2021-09-01T14:45:33.000Z

You are right, (noitce: ### validation # does not support multi-GPU validation ) the validation is always running in a single card so the rank must be <= 0 in validation (in the below code).

Answer 5 · 2021-09-03T04:37:56.000Z

I have another question about the iter number of the pretrain sr module, the classsifer and the finetune in training. I can't find it in paper.

Answer 6 · 2021-09-03T05:40:22.000Z

4.1.3 Training Details

Line8 -- pretrain : 500K (original network is 1000K)
Line13 -- Class Module : within 200K

finetune: 500K, see figure6

other details are in code setting file (yml), following the default setting is ok

Answer 7 · 2021-09-08T02:40:42.000Z

Have a new problem. When i train the SR module with 64 features, the loss suddenly become large at around 100k iters. I found others had same problem with RCAN. So i want ask for some advice to you. Thanks!

Answer 8 · 2021-09-08T04:41:49.000Z

Generally, this problem occurs only during training of RCAN. This is because RCAN is large and complex. It crashes easily.
This is my training curves of RCAN…………

But you can still train the model following the last failure iters (using xxx.state), even using a script to resume the training when it became failed. (It will overcome that failure……)

Finally, I suggest you to use small branches such as FSRCNN, its training will be much faster and easier.

Answer 9 · 2021-09-08T12:22:13.000Z

Thanks for your timely reply. I will try it.

Answer 10 · 2021-09-08T14:44:45.000Z

Question again. There is a competition to reproduce your paper. XD. So much question need your help. The class loss with my reproduce doesn't reduce to 0.5 as small as in paper. Thanks for your help. 👍🏻

Answer 11 · 2021-09-09T06:57:28.000Z

Sorry, it's a typo… train_ClassSR_RCAN.yml change train: l1w: 250 to 4. I have updated the file.

PS:
原因是RCAN原作在网络训练时没有把像素值归一到0-1区间，而其他几个网络都是0-1区间的，我们遵循了原作，所以导致l1_loss的值是其他网络的255倍左右，所以l1w应该是1000\255约等于4（我们实际训练的也是用的4），但是重构github代码时一时大意想成了1000\4，就是250了…。我已经去跟比赛方联系了

Answer 12 · 2021-09-12T09:09:36.000Z

您好，我想请问下论文里指的w1，w2，w3

是这三个系数吗？
1000\255是什么意思呢？

Answer 13 · 2021-09-12T09:19:08.000Z

您好，我想请问下论文里指的w1，w2，w3

是这三个系数吗？
1000\255是什么意思呢？

RCAN系数使用4 0.5 3就可以。我们实际训练除RCAN的三个模型使用的系数是1000 0.5 3，写文章时为了写整数就写成了2000 1 6。其他三个模型在训练时使用的数据值的范围是0-1，而RCAN训练时使用的是0-255（可以看到所有有关于RCAN的数据io代码都是专门写过的），所以RCAN的l1loss的值要比其他模型大255倍，所以其他模型的权重是1000 0.5 3，而RCAN应该是1000/255 0.5 3 大约用4 0.5 3 即可。