Why can your model's performance exceed the real?
Silverster98 opened this issue · 7 comments
Why can your model's performance exceed the real (about 4% in R-Precision)? By the way, I noticed that you don't report the MM-Dist like other works. How about the MM-Dist score in your model?
I think that's the reason of the seed. When I was testing my MotionLCM, I found that the performance of the real can exceed the reported official results by a small margin. The number of the real is fluctuant actually.
Another important reason is the evaluator is too weak. hahaha! The community needs new and more robust metrics.
Why can your model's performance exceed the real (about 4% in R-Precision)? By the way, I noticed that you don't report the MM-Dist like other works. How about the MM-Dist score in your model?
I'm sorry that I accidentally overlooked your question for so long due to being busy with other courses during this period. I'm very, very sorry!
I believe there are two main reasons for this observation. First, as @Dai-Wenxun mentioned, the R-Precision metric may not be robust enough. This metric only performs similarity matching for a batch size of 32. However, the HumanML3D dataset contains many similar or even identical, detailed or rough texts, making batch samples a critical factor. Different batch allocations can lead to varying results. As the motion generation field has evolved over the past few years, I personally think that this metric no longer precisely assesses the fine-grained matching between motion and text, this results does not necessarily mean that our generated data is semantically superior to the real data.
Second, the HumanML3D dataset itself includes noisy data. For instance, some motions do not suit the mirroring augmentation used during data preprocessing, leading to anomalous annotations.
These are my current views.
@h-y1heng Hi yiheng. In my experience, the R-P is evaluated by a pre-trained model. We cannot prevent the bias caused by deep models. BTW, I discussed similar issues and other questions with you via email. Could you please have a look~ (^_^)
@h-y1heng Hi yiheng. In my experience, the R-P is evaluated by a pre-trained model. We cannot prevent the bias caused by deep models. BTW, I discussed similar issues and other questions with you via email. Could you please have a look~ (^_^)
@LinghaoChan, thank you for your insights regarding the R-Precision evaluation. However, I haven't received your email regarding this and other questions. Could you please resend it to hyh654@bupt.edu.cn? I'm looking forward to our discussion. Thank you!
@h-y1heng Hi yiheng. In my experience, the R-P is evaluated by a pre-trained model. We cannot prevent the bias caused by deep models. BTW, I discussed similar issues and other questions with you via email. Could you please have a look~ (^_^)
@LinghaoChan, thank you for your insights regarding the R-Precision evaluation. However, I haven't received your email regarding this and other questions. Could you please resend it to hyh654@bupt.edu.cn? I'm looking forward to our discussion. Thank you!
@h-y1heng sent~
I compared other research of motion generation, and few evaluation indicators reported such results. Even the work based on retrieval methods didn't have this situation. may be caused by the following reasons,
i. the test set contains part of the training data. ii. similar to the previous answer, it may be that the seed was selected and applied to specific sample to improve the performance of the model. have you tested the other random seed? or is the current seed get the best result
Of course, it may also be a clerical error