Metadata number audio frames does not match real number of audio frames

Question

Metadata number audio frames does not match real number of audio frames

MKlmt opened this issue 2 months ago · 2 comments

First of all, thanks for the great work and exciting competition.
When loading the data, I noticed a slight mismatch between the number of audio frames provided by the metadata and the number of audio frames when using torchvision.io.read_video(). This only applies when the audio is fake; for real audio, the number of audio frames matches.
The minimal sample below returns for me:
pytorch: torch.Size([1, 112640]) metadata: 111680

Code:

import json
from torchvision.io import read_video
video_path = "<path to dataset>/DeepFake_1M/train/id06744/_c3CCbnZEbU/00011/real_video_fake_audio.mp4"
video_metadata_path = "<path to dataset>/DeepFake_1M/train_metadata/id06744/_c3CCbnZEbU/00011/real_video_fake_audio.json"
frames, audio, sample = read_video(video_path, pts_unit="sec", output_format="TCHW")
print("pytorch:", audio.size())
with open(video_metadata_path, "r") as f:
    metadata = json.load(f)
print("metadata:", metadata["audio_frames"])

Versions:
torch 2.3.0
torchvision 0.18.0

I hope you can help me understand this mismatch. Thank you

Answer 1 · 2024-05-18T14:16:26.000Z

The metadata is generated for simple references without loading the video file. Please develop based on the frame number from real audio.

Answer 2 · 2024-05-21T06:17:59.000Z

Alright. Thank you