Train, Valid and Test dataset may have quite different characteristics
Closed this issue · 2 comments
I had a meta-analysis of each dataset and found interesting results.
- The test dataset has a very short video length.
- About 2 out of 4 actions/answers are taken in the video.
- 90% of correct actions are not taken in the video.
- If we can remove all (average 2) already taken actions from candidates, the randomly chosen accuracy may be around 50%.
(if we assume the narration text correctly describes the action in the video.)
Please note that I may make mistakes.
We acknowledge that our candidate options do include actions that have already occurred in the video. However, it is important to note that the action narrations provided in the task_progress_metadata are intended to serve as a reference only. In practice, during the model inference process, using information from the ground-truth action narrations is not allowed. The model must rely solely on visual observations to infer task progress. Therefore, your approach of using the ground-truth task_progress_metadata to eliminate options is not appropriate.