VideoClips Assertion Error

Hello,

I'm trying to load a big video. Following #1446 I used a VideoClips object, but it's crashing when trying to get clips with certain ids with this error:

AssertionError Traceback (most recent call last)
in ()
----> 1 x = video_clips.get_clip(1)

/usr/local/lib/python3.6/dist-packages/torchvision/datasets/video_utils.py in get_clip(self, idx)
324 video = video[resampling_idx]
325 info["video_fps"] = self.frame_rate
--> 326 assert len(video) == self.num_frames, "{} x {}".format(video.shape, self.num_frames)
327 return video, audio, info, video_idx

AssertionError: torch.Size([0, 1, 1, 3]) x 32

The code I use is just this:

from torchvision.datasets.video_utils import VideoClips
video_clips = VideoClips(["test_video.mp4"], clip_length_in_frames=32, frames_between_clips=32)
for i in range(video_clips.num_clips()):
    x = video_clips.get_clip(i)

video_clips.num_clips() is much bigger than the ids that are failing. Changing the clipt_length or frames_between doesn't help.

Checking the code I see [0,1,1,3] is returned by read_video when no vframes are read:

vision/torchvision/io/video.py

Lines 251 to 254 in 85b8fbf

    
           if vframes: 
        
               vframes = torch.as_tensor(np.stack(vframes)) 
        
           else: 
        
               vframes = torch.empty((0, 1, 1, 3), dtype=torch.uint8)

But, for some clip ids and clip_lengths it's just that the sizes don't match, as the assertion error is something like this AssertionError: torch.Size([19, 360, 640, 3]) x 128

I followed the issue to _read_from_stream and checked no AV exceptions where raised. And running this part of the function:

vision/torchvision/io/video.py

Lines 144 to 150 in 85b8fbf

    
           for idx, frame in enumerate(container.decode(**stream_name)): 
        
               frames[frame.pts] = frame 
        
               if frame.pts >= end_offset: 
        
                   if should_buffer and buffer_count < max_buffer_size: 
        
                       buffer_count += 1 
        
                       continue 
        
                   break

I saw that for an start_pts=32032, end_pts=63063 it returned just one frame on frames with pts=237237. Which is later discarted as it's a lot bigger than end_pts.

Also, the stream.time_base is Fraction(1, 24000) which doesn't match the start and end pts provided by VideoClips.

So it seems there is a problem with the seeking on my video. But it has a standard h264 encoding and I have no problem reading it sequentially with pyav.
I'm wondering if I'm doing something wrong or there might be an issue with the read_video seeking (as the warning says it should be using seconds?).

This is the video info according to ffmpeg:

Metadata:
major_brand : mp42
minor_version : 0
compatible_brands: mp42isom
creation_time : 2016-10-10T15:36:46.000000Z
Duration: 00:21:24.37, start: 0.000000, bitrate: 1002 kb/s
Stream #0:0(und): Video: h264 (Main) (avc1 / 0x31637661), yuv420p, 640x360 [SAR 1:1 DAR 16:9], 900 kb/s, 23.98 fps, 23.98 tbr, 24k tbn, 47.95 tbc (default)
Metadata:
handler_name : Telestream Inc. Telestream Media Framework - Release TXGP 2016.42.192059
encoder : AVC
Stream #0:1(eng): Audio: aac (LC) (mp4a / 0x6134706D), 48000 Hz, stereo, fltp, 93 kb/s (default)
Metadata:
handler_name : Telestream Inc. Telestream Media Framework - Release TXGP 2016.42.192059

Thanks!

Hello and thank you for the thorough analysis.

This issue seems like a corrupted file, but as you say, FFMPEG info looks ok.
Have you tried using a different backend ('video_reader' vs 'pyav')? That saved my ass in one case at least.

Best,
Bruno

I'm having a similar problem:

Traceback (most recent call last):
  File "/Users/fernando/git/sudep/scripts/infer_video_kinetics.py", line 99, in <module>
    sample = dataset[i]
  File "/Users/fernando/git/sudep/scripts/infer_video_kinetics.py", line 74, in __getitem__
    video, audio, info, video_idx = self.video_clips.get_clip(idx)
  File "/usr/local/Caskroom/miniconda/base/envs/sudep/lib/python3.6/site-packages/torchvision/datasets/video_utils.py", line 367, in get_clip
    video.shape, self.num_frames
AssertionError: torch.Size([6, 128, 228, 3]) x 8

I'm iterating over a Dataset built using VideoClips. The error happens while retrieving sample number 156 out of 174, so it's not the end of the video. For now, I just commented out the assertion, but this way I can't use a DataLoader because the samples will have different size.

I haven't been able to try with video_reader:

/usr/local/Caskroom/miniconda/base/envs/sudep/lib/python3.6/site-packages/torchvision/__init__.py:64: UserWarning: video_reader video backend is not available
  warnings.warn("video_reader video backend is not available")

@fmassa this seems like a problem similar to what I had on my devmachine which I have attributed to the overall messiness of my conda installation and such: namely, I've had several issues where a standard install would not build the video_reader and I'd have to
a) manually install the dependencies, and
b) build TV from source

note: often a few iterations of a) and b) before everything was working properly

@fepegar can you confirm that this is what's happening?
If so, I'll try get a clean repro for this and see if I can tackle the build system for this.

Thanks and best wishes,
Bruno

@fepegar can you confirm that this is what's happening?

I'm not sure exactly what you'd like me to confirm 😅

I'm on macOS, ran this:

$ conda create -n tv python -y && conda activate tv && pip install torch torchvision
$ python -c "import torchvision; torchvision.set_video_backend('video_reader')"

And got the above message. I'll investigate further. But I feel like this discussion should maybe move to a new issue.

My value of ext_specs is None here, in case it helps.

vision/torchvision/io/_video_opt.py

Line 24 in 7a36388

ext_specs = extfinder.find_spec("video_reader")

I just tried building from source, but I'm still not able to set the video_reader backend.

@fmassa do you have a idea about the issue?

Hello,

We digged a bit more in this and found that setting should_buffer to True fixes the issue:

vision/torchvision/io/video.py

Line 110 in 85b8fbf

should_buffer = False

The problem is in this section that reads the frames:

vision/torchvision/io/video.py

Lines 144 to 150 in 85b8fbf

    
           for idx, frame in enumerate(container.decode(**stream_name)): 
        
               frames[frame.pts] = frame 
        
               if frame.pts >= end_offset: 
        
                   if should_buffer and buffer_count < max_buffer_size: 
        
                       buffer_count += 1 
        
                       continue 
        
                   break

PTS might not be read in order and this causes the break to happen before all the relevant frames have been read.

For example in our case our end_offset is 15 but first a frame with PTS 15 is received and then one with PTS 14. So we hit the break without reading frame 14 and we crash latter on the assert for size.

It seems this can happen with AVI videos, I found this discussion on PyAV relevant PyAV-Org/PyAV#534. We confirm we are in a similar case, our AVI video has frames without PTS as it is not strictly required.

Setting the should_buffer to true seems a good solution, is there any reason why this is set to false or not exposed as a parameter? Another solution could be doing a hard compare frame.pts == end_offset i'm not fully sure if this always happens, but if end_offset is chosen as in VideoClips (selecting keyframes) it should work too.

Hi @mjunyent

Thanks for the investigation!

We could make should_buffer be True by default. This would have a small impact on runtime speed though, but might be better to do this in order to avoid those corner-case issues.

The issue I found with empty pts was due to packed b-frames in DivX, but that was the only case I found for this type of video. I agree that the handling for this is very fragile though.

If you could do some performance benchmarks comparing the runtime penalty of always setting should_buffer to True and without, and the results are not much slower, could you send a PR setting should_buffer to True?

Thanks!

Shall I create an issue about video_reader not being available or you think it's been fixed in #2183?

@fepegar please open a new issue. I hope it was fixed with #2183 , so if you try that first it would be graet.

I'm still having this issue (on Linux). I'm using version 0.6.0 and I set should_buffer = True.

My video:

$ ffprobe 006_01_L.mp4                                                                        
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from '006_01_L.mp4':
  Metadata:
    major_brand     : isom
    minor_version   : 512
    compatible_brands: isomiso2mp41
    encoder         : Lavf57.83.100
  Duration: 00:01:52.13, start: 0.000000, bitrate: 691 kb/s
    Stream #0:0(und): Video: hevc (Rext) (hev1 / 0x31766568), yuv444p(tv, progressive), 640x360, 558 kb/s, 15 fps, 15 tbr, 15360 tbn, 15 tbc (default)
    Metadata:
      handler_name    : VideoHandler
    Stream #0:1(und): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 128 kb/s (default)
    Metadata:
      handler_name    : SoundHandler

Should I open a new issue for this?

@fepegar Hi, do you solve this probem?

@mjunyent Hi, do you solve this probem?

I still run into this issue

This is still an issue that was recently re-introduced in #3791

This is the same problem as #4839 and #4112

Raising the priority to high because it's been broken for several months already

Still seeing this error.

Still seeing this error.

@jramapuram Could you please confirm if #5489 fixes your error?

I am running into a similar issue, where the VideoClips instance is returns exactly one more frame than expected (tested with several values).

I am using PyAV as a backend on torch=1.12.1 and torchvision=0.12.0. Dataset is Kinetics downloaded form the S3 bucket referenced in the Kinetics dataset class.

I have no idea of how to solve this, or if it's even a problem. I could just drop the last frame, but that doesn't seem like what I should do.

	if vframes:
	vframes = torch.as_tensor(np.stack(vframes))
	else:
	vframes = torch.empty((0, 1, 1, 3), dtype=torch.uint8)

	for idx, frame in enumerate(container.decode(**stream_name)):
	frames[frame.pts] = frame
	if frame.pts >= end_offset:
	if should_buffer and buffer_count < max_buffer_size:
	buffer_count += 1
	continue
	break