关于训练时epoch数的问题

Question

关于训练时epoch数的问题

Closed this issue 2 years ago · 7 comments

我注意到您声明在训练时使用300epoch在ssd网络。由于时间紧张，我希望训练少一些的epoch数，请问在哪里改，比如训练100epoch。
我注意到您的代码里有epoch=2 和epoch_ratio = [5, 1]，并没有看到其他关于epoch的设置。

Answer 1 · 2022-08-02T00:51:52.000Z

您好，想請教關於baseline中 random selection 不採用 IUL, IUR 的設定為何？我正在重現該實驗，感謝您的回覆

Thanks for the amazing work, I wanted to ask the setting of random selection without IUL and IUR in baseline for reproducing it.
Thanks for you help.

Answer 2 · 2022-08-09T13:37:34.000Z

当 epoch 数目为 300 时，对应检测网络为 SSD，这里是其配置文件的链接。

以下这些变量决定了总 epoch 数：第 82 行的 epoch_ratio、86 行的 epoch、89 行的 X_L_repeat 和 90 行的 X_U_repeat，这些变量上方的 1~2 行的注释为其表示的意义。

例如，总 epoch 数为 300 时，中间的计算过程为 (5+(1+1+5)*2)*16=304≈300。

你可以看看修改哪个变量比较合适，推荐仅同步修改 L_repeat 和 U_repeat 这两个变量的值即可。

When the number of epochs is 300, the corresponding detector is SSD, and here is the link to its configuration file.

The following variables determine the total number of epochs, epoch_ratio on Line 82, epoch on Line 86, X_L_repeat on Line 89, and X_U_repeat on Line 90. The 1~2 lines commented before these variables are their meanings.

For example, when the total number of epochs is 300, the intermediate calculation process is (5+(1+1+5)*2)*16=304≈300.

You can decide which variable is more appropriate to modify. It is recommended to only modify the values of the two variables L_repeat and U_repeat synchronously.

Answer 3 · 2022-08-09T13:45:22.000Z

您好，想請教關於baseline中 random selection 不採用 IUL, IUR 的設定為何？我正在重現該實驗，感謝您的回覆

Thanks for the amazing work, I wanted to ask the setting of random selection without IUL and IUR in baseline for reproducing it. Thanks for you help.

您好，可以参考这里的修改方案。

Hello, you can refer to the modification here.

Answer 4 · 2022-08-09T15:29:57.000Z

您好，想請教關於baseline中 random selection 不採用 IUL, IUR 的設定為何？我正在重現該實驗，感謝您的回覆
Thanks for the amazing work, I wanted to ask the setting of random selection without IUL and IUR in baseline for reproducing it. Thanks for you help.

您好，可以参考这里的修改方案。

Hello, you can refer to the modification here.

好的，謝謝您的回覆，我會試著去修正這個部分

I'll try to modify that, thanks for your help.

Answer 5 · 2022-09-08T02:33:55.000Z

作者您好！我研读了一下您的代码，想与您确认一下我对训练结构以及参数的理解是否正确：
epoch：外部epoch数，一个“外部epoch”的流程是“训练模型整体->两次重赋权与min/max”；若当前为第一个epoch则需要首先进行模型整体训练。
epoch_ratio：该数组的结构为[模型整体训练所需epoch数, 赋权与min/max所需epoch数]。
cycle：该数组代表初始取数据到最终数据集经过的流程，每一个cycle之间差一个增加数据集量的比例。
因此计算总epoch的方法是否应该为(epoch_ratio[0]+(epoch_ratio[0]+epoch_ratio[1]*2)*epoch)*len(cycle)？我在查看代码的时候没有看到X_U_repeat和X_L_repeat的使用位置（包括这两个值赋给的cfg.data.train.times，没有找到后续在哪里使用），想询问一下这两个变量的设计逻辑。

Answer 6 · 2022-09-15T09:16:08.000Z

作者您好！我研读了一下您的代码，想与您确认一下我对训练结构以及参数的理解是否正确：
epoch：外部epoch数，一个“外部epoch”的流程是“训练模型整体->两次重赋权与min/max”；若当前为第一个epoch则需要首先进行模型整体训练。
epoch_ratio：该数组的结构为[模型整体训练所需epoch数, 赋权与min/max所需epoch数]。
cycle：该数组代表初始取数据到最终数据集经过的流程，每一个cycle之间差一个增加数据集量的比例。
因此计算总epoch的方法是否应该为(epoch_ratio[0]+(epoch_ratio[0]+epoch_ratio[1]*2)*epoch)*len(cycle)？我在查看代码的时候没有看到X_U_repeat和X_L_repeat的使用位置（包括这两个值赋给的cfg.data.train.times，没有找到后续在哪里使用），想询问一下这两个变量的设计逻辑。

您好！感谢您对于我们工作的关注。您对训练结构以及参数的理解没有问题。

有关 X_U_repeat 和 X_L_repeat 的使用，确实是赋给了 cfg.data.train.times，关于它的调用位置可以参见这里，其含义为数据在 dataloader 中重复的次数。

其设计逻辑源于 MMDetection 自带的配置文件。根据它在这里的解释，实际等价的 epoch 数等于 你如上计算出的总 epoch 数 * 相应数据的重复次数。

在我们的框架中，实际 epoch 数需要在你计算的呈现出的 epoch 数基础上对应地乘以 X_U_repeat 和 X_L_repeat，即 (epoch_ratio[0]*X_L_repeat+(epoch_ratio[1]*X_U_repeat*2 +epoch_ratio[0]*X_L_repeat)*epoch)*len(cycle)，默认值为 (3*2+(1*2*2+3*2)*2)*7=182。

Hello! Thank you for your attention to our work. There is no problem with your understanding of the training structure and parameters.

For the use of X_U_repeat and X_L_repeat, it is assigned to cfg.data.train.times indeed. You can refer to here for its calling location, which means the number of times the data is repeated in the dataloader.

Its design logic comes from the configuration of MMDetection. According to its explanation here, the actual equivalent epoch number is equal to the total number of epochs you calculated above * the number of repetitions of the corresponding data.

In our repository, the actual epoch number needs to be multiplied by X_U_repeat and X_L_repeat correspondingly based on the presented epoch number you calculated, which is (epoch_ratio[0]*X_L_repeat+(epoch_ratio[1]*X_U_repeat* 2 +epoch_ratio[0]*X_L_repeat)*epoch)*len(cycle), and the default value is (3*2+(1*2*2+3*2)*2)*7=182.

Answer 7 · 2022-09-20T02:45:26.000Z

作者您好！我研读了一下您的代码，想与您确认一下我对训练结构以及参数的理解是否正确：
epoch：外部epoch数，一个“外部epoch”的流程是“训练模型整体->两次重赋权与min/max”；若当前为第一个epoch则需要首先进行模型整体训练。
epoch_ratio：该数组的结构为[模型整体训练所需epoch数, 赋权与min/max所需epoch数]。
cycle：该数组代表初始取数据到最终数据集经过的流程，每一个cycle之间差一个增加数据集量的比例。
因此计算总epoch的方法是否应该为(epoch_ratio[0]+(epoch_ratio[0]+epoch_ratio[1]*2)*epoch)*len(cycle)？我在查看代码的时候没有看到X_U_repeat和X_L_repeat的使用位置（包括这两个值赋给的cfg.data.train.times，没有找到后续在哪里使用），想询问一下这两个变量的设计逻辑。

您好！感谢您对于我们工作的关注。您对训练结构以及参数的理解没有问题。

有关 X_U_repeat 和 X_L_repeat 的使用，确实是赋给了 cfg.data.train.times，关于它的调用位置可以参见这里，其含义为数据在 dataloader 中重复的次数。

其设计逻辑源于 MMDetection 自带的配置文件。根据它在这里的解释，实际等价的 epoch 数等于 你如上计算出的总 epoch 数 * 相应数据的重复次数。

在我们的框架中，实际 epoch 数需要在你计算的呈现出的 epoch 数基础上对应地乘以 X_U_repeat 和 X_L_repeat，即 (epoch_ratio[0]*X_L_repeat+(epoch_ratio[1]*X_U_repeat*2 +epoch_ratio[0]*X_L_repeat)*epoch)*len(cycle)，默认值为 (3*2+(1*2*2+3*2)*2)*7=182。

Hello! Thank you for your attention to our work. There is no problem with your understanding of the training structure and parameters.

For the use of X_U_repeat and X_L_repeat, it is assigned to cfg.data.train.times indeed. You can refer to here for its calling location, which means the number of times the data is repeated in the dataloader.

Its design logic comes from the configuration of MMDetection. According to its explanation here, the actual equivalent epoch number is equal to the total number of epochs you calculated above * the number of repetitions of the corresponding data.

In our repository, the actual epoch number needs to be multiplied by X_U_repeat and X_L_repeat correspondingly based on the presented epoch number you calculated, which is (epoch_ratio[0]*X_L_repeat+(epoch_ratio[1]*X_U_repeat* 2 +epoch_ratio[0]*X_L_repeat)*epoch)*len(cycle), and the default value is (3*2+(1*2*2+3*2)*2)*7=182.

收到，谢谢您的解答