the performance is very low on my own dataset.
onlyonewater opened this issue · 5 comments
Hi, @RenShuhuai-Andy, I think the TimeChat is a great work, but when I test it on my own dataset, the performance is very low, i.e., the mIoU is only 0.076, the question prompt is: You are given a video from a custom dataset. Please find the visual event described by a sentence in the video, determining its starting and ending times. The format should be: 'The event happens at the start time-end time'. For example, The event 'person turn a light on' happens in the 24.3 - 30.4 seconds. Now I will give you the textual sentence: {}. Please return its start time and end time.".format(sentence)
, I set the num-beams as 1 and the temperature as 1, and I do not use any Quantization technology, so could you give me some advice about how to improve the performance?
I think the frame numbers of input may affect the performance since most of the Video LLMs only keep 8 frames or 96 frames as visual input, this is ok when the video duration is short, i.e., less than 30 seconds or 60 seconds. But if the video duration is long, such as more than 60 seconds, the frame numbers should be increased. Do the authors have any comments about it? @RenShuhuai-Andy
Hi, thanks for your interest. Please consider providing more information to help analyze the reasons for the low performance, e.g., the duration / domain / complexity etc. of your videos.
As for the number frames, yes I believe that the frame numbers should be increased if the video duration is long. You know that Gemini 1.5 uses 1 fps, which has more than 96 frames if the video is longer than 96 seconds.
If you want to use more frames in TimeChat, you can change max_frame_pos, n_frms, and num_frm from 96 to any number larger than 96, then directly conduct inference. However, this will cause more gpu memory and I‘m not sure about the performance under this situation. We will explore a more strong TimeChat with long-context capbility in the future.
thanks for your responses, the video duration in my dataset is about 120 seconds on average, and my video is not the real world, so maybe there is a domain gap between the training data and my test data? you mentioned the parameters : max_frame_pos
, n_frms
, and num_fm
, could you give some explanations about these parameters? I think they have the same meaning which means the frame numbers of input.
OK, got it, thanks!