MCG-NJU/MeMOTR

Performance Reproduction

mattiasegu opened this issue · 11 comments

Hi, thank you very much for providing this well-structured codebase!

I tried training MeMOTR (with DAB-DETR) on DanceTrack and run into performance issues. In particular, using the provided config file and pretrained checkpoint I only obtain:

HOTA DetA AssA
62.481 74.141 52.901

In particular, the association accuracy lags > 2 points behind the performance reported in the paper. Was anyone able to reproduce the original performance? Is there anything I'm missing? @HELLORPG have you tried training this model with the current codebase and config file? Thanks in advance for your help!

May I ask how many GPUs you have used for training? 8 GPUs?

Yes, 8 GPUs NVIDIA RTX 4090. I use the --use-checkpoint flag and the default learning rate provided in the config file. I'm now trying with deformable DETR to see if the issue I have is only with DAB-DETR

And I do re-run our code before open source. However, I do not evaluate it on the val set but directly on the test set. It achieved the desired results.
Could you please submit your result to the Codalab server so we can see its performance on the test set?
As I discussed here:

# We change some parameters compared with our paper, looking forward more stable training convergence.

I tried my best to make the work more consistent. In my experience and that of others, convergence on DanceTrack can easily become unstable. However, with our codebase, it should be possible to maintain < 1.0 HOTA swing on DanceTrack (in my reproduction, they only occur ~0.5 HOTA).

Some other evidence of inconsistent results on DanceTrack is as follows:

  • In OC-SORT they also face ~0.5 HOTA instability. The situation will be more serious in the E2E model than in the heuristic algorithm.
  • Issue of MOTRv2 also reported the instability.

I found that the results on SportsMOT are more stable.

BTW, I suggest that you can re-run the DAB-D-DETR version again, and see what will happen. If it's depended on luck, I don't believe we will have such bad luck twice in a row.

One more thing, have you used --use-checkpoint during your training? I used to try to run this exp on 3090 24G, but when processing 5 frames, the CUDA memory will be insufficient.

One more thing, have you used --use-checkpoint during your training? I used to try to run this exp on 3090 24G, but when processing 5 frames, the CUDA memory will be insufficient.

Forget about it. I missed it in your reply. My fault.

Thank you very much for your detailed replies and for your efforts! I will keep the issue updated as soon as I:

  • re-run the DAB-DETR training
  • run the DeformableDETR training
  • validate the models on the DanceTrack test set
  • train on other datasets (e.g. SportsMOT)

My pleasure. Keep in touch~

If anyone else has tried to reproduce the experiments on DanceTrack, you can post your results here to give us more evidence.

Update: with MeMOTR deformable DETR I can get a performance that is reasonably close to that reported in the paper for DanceTrack val:

Paper My Run
HOTA DetA AssA HOTA DetA AssA
61.0 71.2 52.5 60.8 72.1 51.5

As you mentioned, there is quite some variance from one epoch to the other, but this result seems more satisfying than the one obtained with DAB-DETR :)

Nonetheless, it seems that the difference lies in the association accuracy

Thank you very much for your result. I believe it will help those who reproduce the experiments later to get more information beyond our paper.

In the past few days, I have used the code from this repository to conduct reproduction experiments. Due to the limitations of GPU resources (you know, a lot of exps for other work), I only reproduced the MeMOTR with DAB-D-DETR.

Here is my result:

HOTA DetA AssA
val 63.7 74.5 54.6
test 67.7 80.1 57.4

and my log.txt is here.

In my experience, this is an acceptable result on DanceTrack. However, to be honest, this instability is really frustrating. According to my previous exploration, careful adjustment of training strategies is required to alleviate this issue. But I don't have so many GPUs to repeat a large number of experiments (if you want to verify the stability of training, you need to conduct at least 3~4 times for a specific setting). If you or anyone else has any ideas or results about it, please feel free to discuss them with me. I'm also trying to alleviate this problem in the extended version.