Challenges in replicating Adatad

Question

Challenges in replicating Adatad

Closed this issue 5 months ago · 2 comments

Hi！

I found that when I reproduced adatad on the THUMOS dataset, the mAP I got was always significantly different from what was presented in the paper. I will provide my training process and parameters.

I ran the command:
torchrun --nnodes=1 --nproc_per_node=1 --rdzv_backend=c10d --rdzv_endpoint=localhost:0 tools/train.py configs/adatad/thumos/e2e_thumos_videomae_s_768x1_160_adapter.py

After completing 60 epochs of training, the result was:

2024-09-28 16:37:35 Train INFO: Evaluation starts...
2024-09-28 16:37:53 Train INFO: Loaded annotations from validation subset.
2024-09-28 16:37:53 Train INFO: Number of ground truth instances: 3325
2024-09-28 16:37:53 Train INFO: Number of predictions: 422000
2024-09-28 16:37:53 Train INFO: Fixed threshold for tiou score: [0.3, 0.4, 0.5, 0.6, 0.7]
2024-09-28 16:37:53 Train INFO: Average-mAP: 60.30 (%)
2024-09-28 16:37:53 Train INFO: mAP at tIoU 0.30 is 79.42%
2024-09-28 16:37:53 Train INFO: mAP at tIoU 0.40 is 72.74%
2024-09-28 16:37:53 Train INFO: mAP at tIoU 0.50 is 62.65%
2024-09-28 16:37:53 Train INFO: mAP at tIoU 0.60 is 50.92%
2024-09-28 16:37:53 Train INFO: mAP at tIoU 0.70 is 35.77%
2024-09-28 16:37:53 Train INFO: Training Over...

The displayed ave. mAP of 60.30 is significantly different from the 69.03 provided in the paper. However, I made some changes to the parameters in the configuration file. I will list all the changes I made:

I adjusted train=dict(batch_size=2, num_workers=2),val=dict(batch_size=2, num_workers=2) to train=dict(batch_size=8, num_workers=11),val=dict(batch_size=2, num_workers=3). I found that this allows me to maximize the utilization of the machine while ensuring training stability.
I adjusted the learning rate of the backbone in the optimizer from 1e-4 to 2e-4 and the learning rate of the adapter from 2e-4 to to 4e-4. Since I can only train on one GPU and my total batch size is 8, while the configuration you provided runs on two GPUs with a total batch size of 4, I adjusted the learning rate to twice the original size according to the linear scaling rule.
I changed the parameters related to printing the training process in the workflow, but the end_epoch remains at 60, which does not affect the final training result.

All other training parameters remain consistent with the original ones. The complete training log is provided here.

I wonder if my hyperparameter settings are causing the poor final performance. Could you please give me some suggestions and opinions?

thanks

Answer 1 · 2024-09-28T17:19:03.000Z

First, adjusting the batch size would significantly affect the detection performance on the THUMOS dataset. Although a larger batch size means better GPU utilization, the detector (i.e., ActionFormer) prefers a smaller batch size on this dataset.

Second, the default setting is that the total batch size is 2, and there is 1 sample per GPU (if using 2 GPUs). I think you should try to use the released config to train the model, whose result should be close to 69%.
You can try the same config and run on 1 or 2 GPUs (with a total batch size of 2).

Answer 2 · 2024-09-29T11:52:25.000Z

Thank you for your suggestion, it really worked.