batchsize和iters以及epoch问题

Question

batchsize和iters以及epoch问题

absqh opened this issue 8 months ago · 8 comments

想问一下作者是使用什么机器进行的实验，我使用单卡4090在bcd任务上跑SYSU数据集的话，batchsize只能设置为8，那么max_iters是不是应该改为640000（论文中训练iters是设置为20000，readme提供的命令行中bs=16，max_iters=320000），这样才能和epoch对应起来，还是说我只需要跟论文中的iters=20000对应起来即可，这样的话我就设置max_iters=40000，相比起来能缩短十几倍计算成本

Answer 1 · 2024-05-28T08:41:21.000Z

Hi,

Thank you so much for your question! There is no need for you to change the max_iters. Just keep the same value as 320000. The specific number of iterations will be max_iters / batch size.

Best,

Answer 2 · 2024-05-29T09:20:37.000Z

感谢你的回复。但我有一些困惑，如果要复现您的论文实验不是应该对齐epoch吗？还是说只需要对其实验的具体迭代次数？如果我保持320000不变的话岂不是前两个都会有所变化吗？

Answer 3 · 2024-05-29T09:47:40.000Z

Thanks for your question. The epoch is only aligned while keeping max_iters constant. You can manually calculate it yourself.

For example, in my case, the batch size is 16, then the iteration will be 320000/16 = 20000. For your case, the value is 320000 / 8 = 40000. Although the final number of iterations is different, the network is exposed to the same amount of samples in both cases. This aligns the underlying reason of keeping the number of epochs consistent.

Best,

Answer 4 · 2024-05-29T11:43:36.000Z

谢谢，我明白了，之前这个逻辑没有搞清楚。320000/16 = 20000这个公式的代码实现可以告诉我在哪里吗？

Answer 5 · 2024-05-29T11:55:36.000Z

Glad to hear that. Please refer to the code of dataset/dataloder.

if max_iters is not None:
self.data_list = self.data_list * int(np.ceil(float(max_iters) / len(self.data_list)))
self.data_list = self.data_list[0:max_iters]

Answer 6 · 2024-08-29T11:56:47.000Z

Thanks for your question. The epoch is only aligned while keeping max_iters constant. You can manually calculate it yourself.谢谢你的问题。纪元仅在保持 max_iters 不变的情况下对齐。您可以自己手动计算。

For example, in my case, the batch size is 16, then the iteration will be 320000/16 = 20000. For your case, the value is 320000 / 8 = 40000. Although the final number of iterations is different, the network is exposed to the same amount of samples in both cases. This aligns the underlying reason of keeping the number of epochs consistent.例如，在我的情况下，批次大小为 16，则迭代将为 320000/16 = 20000。对于您的情况，该值为 320000 / 8 = 40000。尽管最终迭代次数不同，但在这两种情况下，网络都会暴露给相同数量的样本。这与保持 epoch 数一致的根本原因保持一致。

Best, 最好

Very nice work. However, I am still confused about your statement ‘Although the final number of iterations is different, the network is exposed to the same amount of samples in both cases’. ‘When max_iters=320000 and batch_size=8, there will be 40000 iterations, and each iteration will be a parameter update, which means 40000 updates. And when batch_size= 16, it will update the parameters 20000 times. Are these two consistent?

Answer 7 · 2024-08-29T12:29:52.000Z

Hello and thank you for pointing this out! You are correct. From the point of view of the sample information, they are the same. However, there may be some differences if you take into account the adaptive optimizer. However, according to our experiments, the final accuracy will not be much different.

Answer 8 · 2024-08-29T13:01:07.000Z

Thank you for your reply, it gives me a new understanding. Once again, congratulations on your work, which I will follow and continue to explore.