训练精度提不上去
ai1361720220000 opened this issue · 7 comments
你好,我将数据集换用自己的一个分类的问题,修改了num_classes
但是在训练前几个epoch,模型就很快达到81%,之后就不变了?
[INFO: 2021-08-06 03:11:28,969] Epoch: 1, top1_acc = 81.50%, top5_acc = 88.03% in 1011
[INFO: 2021-08-06 03:11:50,782] Epoch: 1, top1_acc = 78.34%, top5_acc = 90.80% in 1011
[INFO: 2021-08-06 03:11:54,979] Current checkpoints:
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-1.pth.tar', 0.7833827893175074)
[INFO: 2021-08-06 03:12:06,148] Distributing BatchNorm running means and vars
[INFO: 2021-08-06 03:12:29,904] Epoch: 2, top1_acc = 81.50%, top5_acc = 91.30% in 1011
[INFO: 2021-08-06 03:12:52,995] Epoch: 2, top1_acc = 80.32%, top5_acc = 91.00% in 1011
[INFO: 2021-08-06 03:12:57,295] Current checkpoints:
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-2.pth.tar', 0.8031651829871415)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-1.pth.tar', 0.7833827893175074)
[INFO: 2021-08-06 03:13:07,683] Distributing BatchNorm running means and vars
[INFO: 2021-08-06 03:13:30,486] Epoch: 3, top1_acc = 81.50%, top5_acc = 90.60% in 1011
[INFO: 2021-08-06 03:13:51,824] Epoch: 3, top1_acc = 81.40%, top5_acc = 91.10% in 1011
[INFO: 2021-08-06 03:13:55,495] Current checkpoints:
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-3.pth.tar', 0.8140454995054401)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-2.pth.tar', 0.8031651829871415)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-1.pth.tar', 0.7833827893175074)
[INFO: 2021-08-06 03:14:06,663] Distributing BatchNorm running means and vars
[INFO: 2021-08-06 03:14:28,298] Epoch: 4, top1_acc = 81.50%, top5_acc = 90.90% in 1011
[INFO: 2021-08-06 03:14:50,402] Epoch: 4, top1_acc = 81.50%, top5_acc = 91.00% in 1011
[INFO: 2021-08-06 03:14:54,367] Current checkpoints:
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-4.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-3.pth.tar', 0.8140454995054401)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-2.pth.tar', 0.8031651829871415)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-1.pth.tar', 0.7833827893175074)
[INFO: 2021-08-06 03:15:04,976] Distributing BatchNorm running means and vars
[INFO: 2021-08-06 03:15:28,693] Epoch: 5, top1_acc = 81.50%, top5_acc = 90.60% in 1011
[INFO: 2021-08-06 03:15:50,798] Epoch: 5, top1_acc = 81.50%, top5_acc = 91.00% in 1011
[INFO: 2021-08-06 03:15:54,580] Current checkpoints:
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-4.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-5.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-3.pth.tar', 0.8140454995054401)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-2.pth.tar', 0.8031651829871415)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-1.pth.tar', 0.7833827893175074)
[INFO: 2021-08-06 03:16:05,745] Distributing BatchNorm running means and vars
[INFO: 2021-08-06 03:16:28,279] Epoch: 6, top1_acc = 81.50%, top5_acc = 90.60% in 1011
[INFO: 2021-08-06 03:16:50,315] Epoch: 6, top1_acc = 81.50%, top5_acc = 91.00% in 1011
[INFO: 2021-08-06 03:16:53,875] Current checkpoints:
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-4.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-5.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-6.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-3.pth.tar', 0.8140454995054401)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-2.pth.tar', 0.8031651829871415)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-1.pth.tar', 0.7833827893175074)
[INFO: 2021-08-06 03:17:04,938] Distributing BatchNorm running means and vars
[INFO: 2021-08-06 03:17:25,901] Epoch: 7, top1_acc = 81.50%, top5_acc = 90.60% in 1011
[INFO: 2021-08-06 03:17:47,415] Epoch: 7, top1_acc = 81.50%, top5_acc = 91.00% in 1011
[INFO: 2021-08-06 03:17:51,371] Current checkpoints:
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-4.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-5.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-6.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-7.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-3.pth.tar', 0.8140454995054401)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-2.pth.tar', 0.8031651829871415)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-1.pth.tar', 0.7833827893175074)
[INFO: 2021-08-06 03:18:03,222] Distributing BatchNorm running means and vars
[INFO: 2021-08-06 03:18:25,004] Epoch: 8, top1_acc = 81.50%, top5_acc = 90.60% in 1011
[INFO: 2021-08-06 03:18:45,917] Epoch: 8, top1_acc = 81.50%, top5_acc = 91.00% in 1011
[INFO: 2021-08-06 03:18:49,683] Current checkpoints:
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-4.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-5.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-6.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-7.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-8.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-3.pth.tar', 0.8140454995054401)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-2.pth.tar', 0.8031651829871415)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-1.pth.tar', 0.7833827893175074)
[INFO: 2021-08-06 03:19:00,600] Distributing BatchNorm running means and vars
[INFO: 2021-08-06 03:19:22,594] Epoch: 9, top1_acc = 81.50%, top5_acc = 90.60% in 1011
[INFO: 2021-08-06 03:19:44,611] Epoch: 9, top1_acc = 81.50%, top5_acc = 91.00% in 1011
[INFO: 2021-08-06 03:19:48,289] Current checkpoints:
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-4.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-5.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-6.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-7.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-8.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-9.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-3.pth.tar', 0.8140454995054401)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-2.pth.tar', 0.8031651829871415)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-1.pth.tar', 0.7833827893175074)
[INFO: 2021-08-06 03:19:59,657] Distributing BatchNorm running means and vars
[INFO: 2021-08-06 03:20:21,286] Epoch: 10, top1_acc = 81.50%, top5_acc = 90.60% in 1011
[INFO: 2021-08-06 03:20:42,669] Epoch: 10, top1_acc = 81.50%, top5_acc = 91.00% in 1011
[INFO: 2021-08-06 03:20:47,873] Current checkpoints:
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-4.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-5.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-6.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-7.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-8.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-9.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-10.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-3.pth.tar', 0.8140454995054401)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-2.pth.tar', 0.8031651829871415)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-1.pth.tar', 0.7833827893175074)
[INFO: 2021-08-06 03:20:57,835] Distributing BatchNorm running means and vars
[INFO: 2021-08-06 03:21:19,902] Epoch: 11, top1_acc = 81.50%, top5_acc = 90.60% in 1011
[INFO: 2021-08-06 03:21:42,415] Epoch: 11, top1_acc = 81.50%, top5_acc = 91.00% in 1011
[INFO: 2021-08-06 03:21:46,052] Cleaning checkpoint: ('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-1.pth.tar', 0.7833827893175074)
[INFO: 2021-08-06 03:21:46,121] Current checkpoints:
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-4.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-5.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-6.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-7.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-8.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-9.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-10.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-11.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-3.pth.tar', 0.8140454995054401)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-2.pth.tar', 0.8031651829871415)
[INFO: 2021-08-06 03:21:57,140] Distributing BatchNorm running means and vars
[INFO: 2021-08-06 03:22:19,407] Epoch: 12, top1_acc = 81.50%, top5_acc = 98.12% in 1011
[INFO: 2021-08-06 03:22:43,320] Epoch: 12, top1_acc = 81.50%, top5_acc = 91.00% in 1011
[INFO: 2021-08-06 03:22:46,589] Cleaning checkpoint: ('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-2.pth.tar', 0.8031651829871415)
[INFO: 2021-08-06 03:22:46,671] Current checkpoints:
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-4.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-5.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-6.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-7.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-8.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-9.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-10.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-11.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-12.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-3.pth.tar', 0.8140454995054401)
[INFO: 2021-08-06 03:22:58,515] Distributing BatchNorm running means and vars
[INFO: 2021-08-06 03:23:20,010] Epoch: 13, top1_acc = 81.50%, top5_acc = 97.73% in 1011
[INFO: 2021-08-06 03:23:41,497] Epoch: 13, top1_acc = 81.50%, top5_acc = 91.00% in 1011
[INFO: 2021-08-06 03:23:45,675] Cleaning checkpoint: ('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-3.pth.tar', 0.8140454995054401)
[INFO: 2021-08-06 03:23:45,767] Current checkpoints:
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-4.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-5.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-6.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-7.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-8.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-9.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-10.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-11.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-12.pth.tar', 0.8150346191889218)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-13.pth.tar', 0.8150346191889218)
[INFO: 2021-08-06 03:23:56,316] Distributing BatchNorm running means and vars
[INFO: 2021-08-06 03:24:18,093] Epoch: 14, top1_acc = 81.50%, top5_acc = 97.82% in 1011
[INFO: 2021-08-06 03:24:39,590] Epoch: 14, top1_acc = 81.50%, top5_acc = 91.00% in 1011
[INFO: 2021-08-06 03:24:53,347] Distributing BatchNorm running means and vars
[INFO: 2021-08-06 03:25:14,962] Epoch: 15, top1_acc = 81.50%, top5_acc = 98.42% in 1011
[INFO: 2021-08-06 03:25:35,680] Epoch: 15, top1_acc = 81.50%, top5_acc = 91.00% in 1011
[INFO: 2021-08-06 03:25:51,451] Distributing BatchNorm running means and vars
[INFO: 2021-08-06 03:26:13,676] Epoch: 16, top1_acc = 81.50%, top5_acc = 98.22% in 1011
[INFO: 2021-08-06 03:26:35,126] Epoch: 16, top1_acc = 81.50%, top5_acc = 91.00% in 1011
[INFO: 2021-08-06 03:26:50,542] Distributing BatchNorm running means and vars
[INFO: 2021-08-06 03:27:11,586] Epoch: 17, top1_acc = 81.50%, top5_acc = 98.32% in 1011
[INFO: 2021-08-06 03:27:33,794] Epoch: 17, top1_acc = 81.50%, top5_acc = 91.00% in 1011
[INFO: 2021-08-06 03:27:49,459] Distributing BatchNorm running means and vars
[INFO: 2021-08-06 03:28:10,617] Epoch: 18, top1_acc = 81.50%, top5_acc = 98.22% in 1011
[INFO: 2021-08-06 03:28:33,301] Epoch: 18, top1_acc = 81.50%, top5_acc = 91.00% in 1011
[INFO: 2021-08-06 03:28:49,734] Distributing BatchNorm running means and vars
[INFO: 2021-08-06 03:29:13,685] Epoch: 19, top1_acc = 81.50%, top5_acc = 97.92% in 1011
[INFO: 2021-08-06 03:29:37,327] Epoch: 19, top1_acc = 81.50%, top5_acc = 91.00% in 1011
[INFO: 2021-08-06 03:29:51,919] Distributing BatchNorm running means and vars
[INFO: 2021-08-06 03:30:14,903] Epoch: 20, top1_acc = 81.50%, top5_acc = 98.32% in 1011
另外请问下,torchvision版本是多少呢
If the dataset is small, the performance will get saturated in a few epochs. My torchvision is 0.10.0+cu102
If the dataset is small, the performance will get saturated in a few epochs. My torchvision is 0.10.0+cu102
我调整了数据集规模,下面是我的训练结果,可以看到每次模型会输出两个eval结果,第一个都会比第二个高,从train.py中看出第二个eval是和model_ema有关,请问model_ema是什么用处的呢?第一个model结果要高很多,为什么要保存model_ema的权重呢?
那在测试阶段,我需要用model读取权重呢?还是用model_ema读取权重呢?
[INFO: 2021-08-06 04:41:23,638] Training in distributed mode with multiple processes, 1 GPU per process. Process 0, total 2.
[INFO: 2021-08-06 04:41:29,901] Model cotnet50 created, flops_count: 3.28 GMac, param count: 20.18 M
[INFO: 2021-08-06 04:41:29,991] AMP not enabled. Training in float32.
[INFO: 2021-08-06 04:41:29,991] Using native Torch DistributedDataParallel.
[INFO: 2021-08-06 04:46:03,136] Epoch: 1/350, Iter: 100/1703, loss: 0.8457, lr: [0.0001], time_avg: 2.6040, eta: 17,23:04:16, mem: 14755
[INFO: 2021-08-06 04:50:24,837] Epoch: 1/350, Iter: 200/1703, loss: 0.6839, lr: [0.0001], time_avg: 2.6105, eta: 18,00:04:20, mem: 14755
[INFO: 2021-08-06 04:54:28,935] Epoch: 1/350, Iter: 300/1703, loss: 0.6422, lr: [0.0001], time_avg: 2.5540, eta: 17,14:38:46, mem: 14755
[INFO: 2021-08-06 04:58:42,534] Epoch: 1/350, Iter: 400/1703, loss: 0.6032, lr: [0.0001], time_avg: 2.5495, eta: 17,13:49:49, mem: 14755
[INFO: 2021-08-06 05:03:01,236] Epoch: 1/350, Iter: 500/1703, loss: 0.5870, lr: [0.0001], time_avg: 2.5570, eta: 17,15:00:03, mem: 14755
[INFO: 2021-08-06 05:07:13,442] Epoch: 1/350, Iter: 600/1703, loss: 0.5675, lr: [0.0001], time_avg: 2.5511, eta: 17,13:57:59, mem: 14755
[INFO: 2021-08-06 05:11:21,041] Epoch: 1/350, Iter: 700/1703, loss: 0.5452, lr: [0.0001], time_avg: 2.5404, eta: 17,12:07:08, mem: 14755
[INFO: 2021-08-06 05:15:35,538] Epoch: 1/350, Iter: 800/1703, loss: 0.5387, lr: [0.0001], time_avg: 2.5410, eta: 17,12:08:31, mem: 14755
[INFO: 2021-08-06 05:19:51,240] Epoch: 1/350, Iter: 900/1703, loss: 0.5256, lr: [0.0001], time_avg: 2.5427, eta: 17,12:21:56, mem: 14755
[INFO: 2021-08-06 05:24:06,144] Epoch: 1/350, Iter: 1000/1703, loss: 0.5251, lr: [0.0001], time_avg: 2.5434, eta: 17,12:23:54, mem: 14755
[INFO: 2021-08-06 05:28:18,535] Epoch: 1/350, Iter: 1100/1703, loss: 0.5264, lr: [0.0001], time_avg: 2.5416, eta: 17,12:02:06, mem: 14755
[INFO: 2021-08-06 05:32:28,834] Epoch: 1/350, Iter: 1200/1703, loss: 0.5145, lr: [0.0001], time_avg: 2.5384, eta: 17,11:25:56, mem: 14755
[INFO: 2021-08-06 05:36:39,931] Epoch: 1/350, Iter: 1300/1703, loss: 0.5076, lr: [0.0001], time_avg: 2.5363, eta: 17,11:00:46, mem: 14755
[INFO: 2021-08-06 05:40:49,729] Epoch: 1/350, Iter: 1400/1703, loss: 0.4995, lr: [0.0001], time_avg: 2.5335, eta: 17,10:29:25, mem: 14755
[INFO: 2021-08-06 05:45:05,531] Epoch: 1/350, Iter: 1500/1703, loss: 0.5066, lr: [0.0001], time_avg: 2.5352, eta: 17,10:41:21, mem: 14755
[INFO: 2021-08-06 05:49:14,644] Epoch: 1/350, Iter: 1600/1703, loss: 0.4886, lr: [0.0001], time_avg: 2.5324, eta: 17,10:09:25, mem: 14755
[INFO: 2021-08-06 05:53:16,233] Epoch: 1/350, Iter: 1700/1703, loss: 0.4862, lr: [0.0001], time_avg: 2.5255, eta: 17,08:57:18, mem: 14755
[INFO: 2021-08-06 05:53:17,954] Distributing BatchNorm running means and vars
[INFO: 2021-08-06 05:57:09,450] Epoch: 1, top1_acc = 90.67%, top5_acc = 97.98% in 31585
[INFO: 2021-08-06 06:00:59,118] Epoch: 1, top1_acc = 60.13%, top5_acc = 77.00% in 31585
[INFO: 2021-08-06 06:01:02,962] Current checkpoints:
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-1.pth.tar', 0.601329745132183)
[INFO: 2021-08-06 06:05:41,443] Epoch: 2/350, Iter: 100/1703, loss: 0.8012, lr: [0.050080000000000006], time_avg: 2.5366, eta: 17,10:43:12, mem: 14755
[INFO: 2021-08-06 06:10:00,337] Epoch: 2/350, Iter: 200/1703, loss: 0.3913, lr: [0.050080000000000006], time_avg: 2.5394, eta: 17,11:06:10, mem: 14755
[INFO: 2021-08-06 06:14:05,733] Epoch: 2/350, Iter: 300/1703, loss: 0.3607, lr: [0.050080000000000006], time_avg: 2.5351, eta: 17,10:19:41, mem: 14755
[INFO: 2021-08-06 06:18:21,840] Epoch: 2/350, Iter: 400/1703, loss: 0.3303, lr: [0.050080000000000006], time_avg: 2.5364, eta: 17,10:27:39, mem: 14755
[INFO: 2021-08-06 06:22:45,340] Epoch: 2/350, Iter: 500/1703, loss: 0.3194, lr: [0.050080000000000006], time_avg: 2.5408, eta: 17,11:07:43, mem: 14755
[INFO: 2021-08-06 06:26:58,533] Epoch: 2/350, Iter: 600/1703, loss: 0.3030, lr: [0.050080000000000006], time_avg: 2.5404, eta: 17,10:59:39, mem: 14755
[INFO: 2021-08-06 06:31:16,240] Epoch: 2/350, Iter: 700/1703, loss: 0.3048, lr: [0.050080000000000006], time_avg: 2.5420, eta: 17,11:10:29, mem: 14755
[INFO: 2021-08-06 06:35:24,422] Epoch: 2/350, Iter: 800/1703, loss: 0.2982, lr: [0.050080000000000006], time_avg: 2.5396, eta: 17,10:42:27, mem: 14755
[INFO: 2021-08-06 06:39:44,430] Epoch: 2/350, Iter: 900/1703, loss: 0.2859, lr: [0.050080000000000006], time_avg: 2.5419, eta: 17,11:01:12, mem: 14755
[INFO: 2021-08-06 06:44:03,033] Epoch: 2/350, Iter: 1000/1703, loss: 0.2884, lr: [0.050080000000000006], time_avg: 2.5435, eta: 17,11:13:06, mem: 14755
[INFO: 2021-08-06 06:48:14,835] Epoch: 2/350, Iter: 1100/1703, loss: 0.2700, lr: [0.050080000000000006], time_avg: 2.5426, eta: 17,10:59:52, mem: 14755
[INFO: 2021-08-06 06:52:31,536] Epoch: 2/350, Iter: 1200/1703, loss: 0.2724, lr: [0.050080000000000006], time_avg: 2.5434, eta: 17,11:03:55, mem: 14755
[INFO: 2021-08-06 06:56:50,044] Epoch: 2/350, Iter: 1300/1703, loss: 0.2617, lr: [0.050080000000000006], time_avg: 2.5448, eta: 17,11:13:23, mem: 14755
[INFO: 2021-08-06 07:00:55,936] Epoch: 2/350, Iter: 1400/1703, loss: 0.2695, lr: [0.050080000000000006], time_avg: 2.5421, eta: 17,10:41:46, mem: 14755
[INFO: 2021-08-06 07:05:29,248] Epoch: 2/350, Iter: 1500/1703, loss: 0.2622, lr: [0.050080000000000006], time_avg: 2.5480, eta: 17,11:36:27, mem: 14755
[INFO: 2021-08-06 07:09:44,442] Epoch: 2/350, Iter: 1600/1703, loss: 0.2492, lr: [0.050080000000000006], time_avg: 2.5481, eta: 17,11:33:22, mem: 14755
[INFO: 2021-08-06 07:13:50,103] Epoch: 2/350, Iter: 1700/1703, loss: 0.2609, lr: [0.050080000000000006], time_avg: 2.5455, eta: 17,11:02:33, mem: 14755
[INFO: 2021-08-06 07:13:51,927] Distributing BatchNorm running means and vars
[INFO: 2021-08-06 07:17:45,207] Epoch: 2, top1_acc = 96.48%, top5_acc = 99.47% in 31585
[INFO: 2021-08-06 07:21:37,988] Epoch: 2, top1_acc = 60.10%, top5_acc = 96.47% in 31585
[INFO: 2021-08-06 07:21:42,041] Current checkpoints:
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-1.pth.tar', 0.601329745132183)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-2.pth.tar', 0.6009814785499445)
[INFO: 2021-08-06 07:26:31,034] Epoch: 3/350, Iter: 100/1703, loss: 0.2728, lr: [0.10006000000000001], time_avg: 2.5536, eta: 17,12:18:49, mem: 14755
[INFO: 2021-08-06 07:30:41,838] Epoch: 3/350, Iter: 200/1703, loss: 0.2638, lr: [0.10006000000000001], time_avg: 2.5524, eta: 17,12:02:04, mem: 14755
[INFO: 2021-08-06 07:34:58,437] Epoch: 3/350, Iter: 300/1703, loss: 0.2675, lr: [0.10006000000000001], time_avg: 2.5527, eta: 17,12:01:26, mem: 14755
[INFO: 2021-08-06 07:39:23,032] Epoch: 3/350, Iter: 400/1703, loss: 0.2529, lr: [0.10006000000000001], time_avg: 2.5552, eta: 17,12:21:21, mem: 14755
[INFO: 2021-08-06 07:43:29,952] Epoch: 3/350, Iter: 500/1703, loss: 0.2500, lr: [0.10006000000000001], time_avg: 2.5530, eta: 17,11:55:22, mem: 14755
[INFO: 2021-08-06 07:47:47,142] Epoch: 3/350, Iter: 600/1703, loss: 0.2475, lr: [0.10006000000000001], time_avg: 2.5534, eta: 17,11:55:46, mem: 14755
[INFO: 2021-08-06 07:51:59,731] Epoch: 3/350, Iter: 700/1703, loss: 0.2466, lr: [0.10006000000000001], time_avg: 2.5528, eta: 17,11:44:53, mem: 14755
[INFO: 2021-08-06 07:56:08,233] Epoch: 3/350, Iter: 800/1703, loss: 0.2354, lr: [0.10006000000000001], time_avg: 2.5512, eta: 17,11:24:44, mem: 14755
[INFO: 2021-08-06 08:00:28,146] Epoch: 3/350, Iter: 900/1703, loss: 0.2421, lr: [0.10006000000000001], time_avg: 2.5523, eta: 17,11:31:28, mem: 14755
[INFO: 2021-08-06 08:05:00,644] Epoch: 3/350, Iter: 1000/1703, loss: 0.2379, lr: [0.10006000000000001], time_avg: 2.5562, eta: 17,12:05:52, mem: 14755
[INFO: 2021-08-06 08:09:24,947] Epoch: 3/350, Iter: 1100/1703, loss: 0.2291, lr: [0.10006000000000001], time_avg: 2.5581, eta: 17,12:20:35, mem: 14755
[INFO: 2021-08-06 08:13:38,432] Epoch: 3/350, Iter: 1200/1703, loss: 0.2312, lr: [0.10006000000000001], time_avg: 2.5576, eta: 17,12:11:21, mem: 14755
[INFO: 2021-08-06 08:17:58,523] Epoch: 3/350, Iter: 1300/1703, loss: 0.2252, lr: [0.10006000000000001], time_avg: 2.5585, eta: 17,12:16:08, mem: 14755
[INFO: 2021-08-06 08:22:19,845] Epoch: 3/350, Iter: 1400/1703, loss: 0.2264, lr: [0.10006000000000001], time_avg: 2.5597, eta: 17,12:23:05, mem: 14755
[INFO: 2021-08-06 08:26:32,135] Epoch: 3/350, Iter: 1500/1703, loss: 0.2221, lr: [0.10006000000000001], time_avg: 2.5589, eta: 17,12:11:26, mem: 14755
[INFO: 2021-08-06 08:30:50,429] Epoch: 3/350, Iter: 1600/1703, loss: 0.2145, lr: [0.10006000000000001], time_avg: 2.5594, eta: 17,12:11:53, mem: 14755
[INFO: 2021-08-06 08:34:52,335] Epoch: 3/350, Iter: 1700/1703, loss: 0.2247, lr: [0.10006000000000001], time_avg: 2.5566, eta: 17,11:40:33, mem: 14755
[INFO: 2021-08-06 08:34:56,030] Distributing BatchNorm running means and vars
[INFO: 2021-08-06 08:38:43,637] Epoch: 3, top1_acc = 95.90%, top5_acc = 99.60% in 31585
[INFO: 2021-08-06 08:42:29,478] Epoch: 3, top1_acc = 60.10%, top5_acc = 97.70% in 31585
[INFO: 2021-08-06 08:42:33,025] Current checkpoints:
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-1.pth.tar', 0.601329745132183)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-2.pth.tar', 0.6009814785499445)
('./cot_experiments/CoTNet_SYS/snapshot/checkpoint-3.pth.tar', 0.6009814785499445)
Exponential moving average(EMA) is mainly used to stabilize the training. model_ema usually achieves better result after about 30 epochs. In your case, you need to modify the model_ema_decay parameter to make EMA work, since your model converge just in a few epochs.
You can set the model_ema_decay smaller, e.g., 0.99 or 0.999. I think you can train the model without ema since your model has already achieved high top-1 accuracy.