lvu_durations.csv

Hi authors,

How are the durations in lvu_durations.csv computed? The last 20s in most videos show preview for other videos. Does lvu_durations.csv show the number of seconds in the video excluding the preview duration?

Thanks

These lines of code

ViS4mer/extract_features/extract_features_lvu_vit.py

Lines 65 to 68 in 2a2442b

    
           for i in range(int(duration)): 
        
               idx = int(video.shape[0] / duration * i) 
        
               x = torch.unsqueeze(video[idx], 0).to(device) 
        
               x = model.forward_features(x)

suggests that these previews are used in training and evaluation. Could you confirm? Thanks!

Hi,
Thanks for reaching out. We used the duration from Condensed Movies dataset. They removed the outro/preview from each video which they describe in section 3.1 of their Paper. Therefore, lvu_durations.csv does not contain the outro/preview of each video.

Thanks for your reply! Do the downloaded mp4 videos have outro/preview removed?

If not, in the following code, outro/preview seems to be included and the same is being used later in training/evals.

ViS4mer/extract_features/extract_features_lvu_vit.py

Lines 58 to 68 in 2a2442b

    
           video = get_video(video_fp) 
        
           video = torch.from_numpy(video.transpose([0, 3, 1, 2])).float() 
        
           duration = duration_data.loc[video_id]['duration'] 
        
           print(cnt, video_id, video.shape, duration) 
        
           features = np.zeros((duration+1, 197, 1024)) 
        
           for i in range(int(duration)): 
        
               idx = int(video.shape[0] / duration * i) 
        
               x = torch.unsqueeze(video[idx], 0).to(device) 
        
               x = model.forward_features(x)

e.g. Consider the video 9NG5mJgw6Yg in writer set with duration = 154s, and the actual video length = 184s. Above code will include frames after 154s containing outro/preview.

In the above example, could you walk through the above code from your codebase, at i=153?
idx = int(184/154*153) = 183
Hence, features[153] = model_fwd(video[183])

In effect, features[153] contains outro frame 183. So, during LVU evals, frame 183 will be used for this video which is not what you intended. This looks like a bug. The same is true for a lot of videos and frames.

Hi,
I think you are right. You need to remove the outro first and we also did that. You can use the duration from 'lvu_durations.csv' to do this.

Thanks! Could you please check and confirm if the reported results in the paper contain outro by any chance in light of the above bug?
The current state of the codebase is definitely using the outro.

Context:
I'm struggling to reproduce results from the paper. There is a 1% difference in performance if I include/exclude the outro, and including the outro puts the results close to the reported results in the paper.

Which task did you try and what performance are you getting? Also, how did you solve the 'NaN' issue? Can you please reply that on the other issue so that other's can benefit from it?

I've not been able to solve the NaN issue. I'm working on a reimplementation in jax building upon annotated-s4

I've tried all the classification tasks. There is a ~1% gap in relationship, director, writer, speaking including/excluding the outro.

	for i in range(int(duration)):
	idx = int(video.shape[0] / duration * i)
	x = torch.unsqueeze(video[idx], 0).to(device)
	x = model.forward_features(x)

	video = get_video(video_fp)
	video = torch.from_numpy(video.transpose([0, 3, 1, 2])).float()
	duration = duration_data.loc[video_id]['duration']
	print(cnt, video_id, video.shape, duration)

	features = np.zeros((duration+1, 197, 1024))

	for i in range(int(duration)):
	idx = int(video.shape[0] / duration * i)
	x = torch.unsqueeze(video[idx], 0).to(device)
	x = model.forward_features(x)