large模型训练
zhangcj13 opened this issue · 6 comments
请教下,large模型训练,可以直接将测试dbface的网络结构替代训练代码里的小模型吧?
尝试用daface-large模型训练出现:
2020-05-20 11:48:09,718 - INFO - train-large.py[:233] - iter: 871, lr: 0.000125, epoch: 6.44, loss: 5.04, hm_loss: 0.36, box_loss: 4.11, lmdk_loss: 0.56668
avg is zero
avg is zero
2020-05-20 11:49:07,651 - INFO - train-large.py[:233] - iter: 881, lr: 0.000125, epoch: 6.52, loss: 4.25, hm_loss: 0.04, box_loss: 4.21, lmdk_loss: 0.00000
avg is zero
2020-05-20 11:50:07,313 - INFO - train-large.py[:233] - iter: 891, lr: 0.000125, epoch: 6.59, loss: 2.70, hm_loss: 0.57, box_loss: 1.89, lmdk_loss: 0.24474
2020-05-20 11:51:05,086 - INFO - train-large.py[:233] - iter: 901, lr: 0.000125, epoch: 6.67, loss: 2.58, hm_loss: 0.05, box_loss: 2.51, lmdk_loss: 0.02192
2020-05-20 11:52:11,552 - INFO - train-large.py[:233] - iter: 911, lr: 0.000125, epoch: 6.74, loss: 1.40, hm_loss: 0.03, box_loss: 1.28, lmdk_loss: 0.09160
landmark 损失会变0,‘’avg is zero‘’不知道什么意思,损失变化震荡幅度比较大是正常的么?
只是因为当前迭代中,landmark没有可训练的,这个提示可以无视,你可以在losses.py中找到这个提示
主要是有些场合下,比如小脸,没有做landmark标注,会出现这种情况,loss为0的可以不考虑,只需要关注非0的值即可
avg is zero好像不是landmark的,是GIoULoss里的,训到后面全是这个,然后模型也不对了,我再训训看
2020-05-22 05:23:14,842 - INFO - train-mobilev2DBface.py[:231] - iter: 18952, lr: 4e-07, epoch: 47.02, loss: 0.01, hm_loss: 0.01, box_loss: 0.00, lmdk_loss: 0.00000
avg is zero
avg is zero
avg is zero
avg is zero
avg is zero
avg is zero
avg is zero
avg is zero
avg is zero
avg is zero
2020-05-22 05:23:35,477 - INFO - train-mobilev2DBface.py[:231] - iter: 18962, lr: 4e-07, epoch: 47.05, loss: 0.02, hm_loss: 0.02, box_loss: 0.00, lmdk_loss: 0.00000
avg is zero
avg is zero
avg is zero
avg is zero
avg is zero
avg is zero
avg is zero
avg is zero
avg is zero
avg is zero
训到后面全是这样,就是没有目标训练了,模型也更新不了了,难道是数据读取到后面出问题了?
问题找到了是数据读取线程设置问题
你好,我也出现了该问题,请问您是怎么解决“数据读取线程设置”的?,求分享一下,谢谢。 @zhangcj13
@qinxianyuzi
DataLoader(dataset=self.train_dataset, batch_size=self.batch_size, shuffle=True, num_workers=4),就是那个num_workers默认是16吧,电脑没那么多核,我最开始改成0就出问题,改成电脑能接受的就行。不知道你是不是这个问题