ControlNet/AV-Deepfake1M

Metadata number audio frames does not match real number of audio frames

MKlmt opened this issue · 2 comments

First of all, thanks for the great work and exciting competition.
When loading the data, I noticed a slight mismatch between the number of audio frames provided by the metadata and the number of audio frames when using torchvision.io.read_video(). This only applies when the audio is fake; for real audio, the number of audio frames matches.
The minimal sample below returns for me:
pytorch: torch.Size([1, 112640]) metadata: 111680

Code:

import json
from torchvision.io import read_video
video_path = "<path to dataset>/DeepFake_1M/train/id06744/_c3CCbnZEbU/00011/real_video_fake_audio.mp4"
video_metadata_path = "<path to dataset>/DeepFake_1M/train_metadata/id06744/_c3CCbnZEbU/00011/real_video_fake_audio.json"
frames, audio, sample = read_video(video_path, pts_unit="sec", output_format="TCHW")
print("pytorch:", audio.size())
with open(video_metadata_path, "r") as f:
    metadata = json.load(f)
print("metadata:", metadata["audio_frames"])

Versions:
torch 2.3.0
torchvision 0.18.0

I hope you can help me understand this mismatch. Thank you

The metadata is generated for simple references without loading the video file. Please develop based on the frame number from real audio.

Alright. Thank you