Can't reproduce Diving48 training

Question

Can't reproduce Diving48 training

ahnGeo opened this issue 2 years ago · 1 comments

Hello!

I want to reproduce your model training on Diving48, but failed. I used your diving48 config file, vitclip_base_diving48.py, with (1) original ver (2) clip len = 8, frame interval = 8,
and command bash tools/dist_train.sh <PATH/TO/CONFIG> <NUM_GPU> --test-best --validate --cfg-options work_dir=<PATH/TO/OUTPUT>.

I wonder what is the problem. Please let me know. Thank you.

environment info
python 3.9.13, pytorch 1.10.0, cuda 11.3
here is a part of log
2023-03-08 21:00:08,793 - mmaction - INFO - Epoch [50][540/627] lr: 2.960e-07, eta: 0:01:07, time: 0.698, data_time: 0.000, memory: 19879, top1_acc: 0.3833, top5_acc: 0.8187, loss_cls: 1.8822, loss: 1.8822
2023-03-08 21:00:22,859 - mmaction - INFO - Epoch [50][560/627] lr: 2.960e-07, eta: 0:00:51, time: 0.703, data_time: 0.006, memory: 19879, top1_acc: 0.3604, top5_acc: 0.8083, loss_cls: 1.9930, loss: 1.9930
2023-03-08 21:00:36,925 - mmaction - INFO - Epoch [50][580/627] lr: 2.960e-07, eta: 0:00:36, time: 0.703, data_time: 0.006, memory: 19879, top1_acc: 0.3688, top5_acc: 0.8250, loss_cls: 1.9158, loss: 1.9158
2023-03-08 21:00:50,874 - mmaction - INFO - Epoch [50][600/627] lr: 2.960e-07, eta: 0:00:20, time: 0.697, data_time: 0.000, memory: 19879, top1_acc: 0.4146, top5_acc: 0.8438, loss_cls: 1.8412, loss: 1.8412
2023-03-08 21:01:04,933 - mmaction - INFO - Epoch [50][620/627] lr: 2.960e-07, eta: 0:00:05, time: 0.703, data_time: 0.006, memory: 19879, top1_acc: 0.3896, top5_acc: 0.8125, loss_cls: 1.9206, loss: 1.9206
2023-03-08 21:01:10,258 - mmaction - INFO - Saving checkpoint at 50 epochs
2023-03-08 21:03:12,731 - mmaction - INFO - Evaluating top_k_accuracy ...
2023-03-08 21:03:12,740 - mmaction - INFO -
top1_acc 0.1025
top5_acc 0.3548
2023-03-08 21:03:12,740 - mmaction - INFO - Evaluating mean_class_accuracy ...
2023-03-08 21:03:12,741 - mmaction - INFO -
mean_acc 0.0586
2023-03-08 21:03:12,799 - mmaction - INFO - The previous best checkpoint /data/aim/outputs/diving48/best_top1_acc_epoch_45.pth was removed
2023-03-08 21:03:14,531 - mmaction - INFO - Now best checkpoint is saved as best_top1_acc_epoch_50.pth.
2023-03-08 21:03:14,531 - mmaction - INFO - Best top1_acc is 0.1025 at 50 epoch.
2023-03-08 21:03:14,532 - mmaction - INFO - Epoch(val) [50][985] top1_acc: 0.1025, top5_acc: 0.3548, mean_class_accuracy: 0.0586
2023-03-08 21:03:15,535 - mmaction - INFO - Warning: test_best set as True, but is not applicable (eval_hook.best_ckpt_path is None)

Answer 1 · 2023-03-19T16:44:53.000Z

Hi @ahnGeo , thanks for your interest in our work. The results in our paper is based on 32 frames. I am not sure about the performance of 8 frames, but the performance seems to be a little low in your case. 1. Please make sure you changed all clip_len=8, including train/val/test and model num_frames. 2. It seems your batchsize is small. Our defaults setting is 8GPU with batchsize=64. If you use different number of GPUs and batchsize, you may need to tune the learning rate to get the best performance. Hope it helps.