I found difficult to reproduce

Question

I found difficult to reproduce

Closed this issue 2 years ago · 5 comments

my environment is pytorch1.8.1, cuda11.1, python3.8.0, batch size is 12
I can not reproduce the result using full data(VID, GOT10k, LASOT, YTVOS)

where A R EAO I reproduce is 0.574 0.393 0.300.
can it be reproduced using pytorch1.7?

Answer 1 · 2022-05-03T01:48:21.000Z

We follow TracKit and sequentially test all checkpoints from epoch 10 to epoch 30, and then select the one with the best result on VOT2018. The best result does not always show up in the last epochs. You may use the script "onekey.py" to test all available checkpoints. Besides, VOT2018 is not a stable benchmark actually. In Table 7, we have mentioned that VOT2018 is very parameter sensitive, and minor revision on hyper-parameters will affect the result moderately. Results on other benchmarks such as LaSOT and TrackingNet are often more stable.

Nevertheless, EAO = 0.300 is still too low. From my experience, obtaining a checkpoint with EAO around 0.330 on VOT2018 without tuning hyper-parameters for testing (memory queue, online weight, etc.) is not difficult. You may need to check if there are some problems on your experiment settings.

Answer 2 · 2022-05-03T02:38:53.000Z

I think the most possible reason is that you do not test all checkpoints from epoch 10 to epoch 30. The following screenshot is the last experiment I conduct before I release the code. Although VOT2018 is not stable, reproducing the EAO result around 0.33 even without tuning hyper-parameters for testing (as Table 7) is definitely possible. Carefully tuning hyper-parameters may further boost the performance.

One phenomenon we notice is that, the results of all checkpoints on VOT2018 have a large variance. This can be attributed to two major reasons: 1. VOT2018 is not stable enough 2. USOT is not stable enough. The latter is what we are currently working hard on for USOT+, namely making the framework more stable and well-performing.

By the way, my recommendation is "don't be obsessed with VOT2018". The best results we obtain on OTB2015 and LaSOT are around 60.0 and 37.0 (AUC), but those checkpoints perform worse on VOT2018, w(ﾟДﾟ)w. So do not waste time on 1-2 points in certain benchmarks. Better training pipeline and preprocessing methodology play a much more important role in unsupervised tracking. There is still a lot of room for improvement.

Answer 3 · 2022-05-03T03:10:02.000Z

thanks for your perfect answer, I am new to unsupervised SOT and not familiar with SOT setting. (I was doing MOT before)I only tested the last epoch's checkpoint and it may be the biggest problem. When test on LASOT dataset, the model is also trained on all the dataset?

Answer 4 · 2022-05-03T03:16:30.000Z

another question: when inference, should I carefully fine-tune hyper-parameters?

Answer 5 · 2022-05-03T03:31:32.000Z

Q1: When testing on LASOT dataset, Is the model also trained on all the dataset?
Yes. The model used for reporting results is consistent for all benchmarks, with the same model weights and hyper-parameters. That's the reason why I do not report the better results mentioned above on LaSOT and OTB2015, as I have to make sure the inference model and settings are exactly the same for all benchmarks

Q2: Should I carefully fine-tune hyper-parameters?
The original TracKit framework uses a tool called "ray" to automatically tune hyper-parameters for testing. I think this design is not elegant, so the tuning phase is omitted during my experiments. The only tuning operation I performed was in Table 7, where I did search for the best hyper-parameters (N_q and w) for that certain checkpoint on VOT2018. However, that is actually one of the reason why the performance is not stable enough. I do not recommend you tune the hyper-parameters, since it's a waste of time (working hard for only 1-2 points). A checkpoint with acc around 0.577 and EAO around 0.330 is good enough, already significantly outperforming all previous unsupervised trackers.