VSR performance lower on MuAViC version of LRS3 (En)
roudimit opened this issue · 2 comments
Hi, thanks for your nice work! I preprocessed the MuAViC dataset according to the instructions. I already had LRS3 processed according to the AV-HuBERT instructions, so I wanted to test if a pre-trained model would get the same performance on both the AV-HuBERT dataset version and the MuAViC version of LRS3.
I first tried ckpt=large_noise_pt_noise_ft_433h.pt
from AV-HuBERT, and ran this command:
python -B infer_s2s.py --config-dir ./conf/ --config-name s2s_decode.yaml \
dataset.gen_subset=test common_eval.path=${ckpts_dir}/${ckpt} \
common_eval.results_path=${exp_dir}/av-hubert/decode/s2s/test \
override.modalities=['audio', 'video'] override.data=${lrs3_dir}/30h_data override.label_dir=${lrs3_dir}/30h_data common.user_dir=`pwd`
Using the AV-HuBERT version of LRS3:
- 433 audio-visual: 1.486
- 433h audio-only: 1.951
- 433h video-only: 34.135
Using the MuAViC version of LRS3:
- 433 audio-visual: 1.496 (slightly worse)
- 433h audio-only: 1.951 (the same)
- 433h video-only: 35.995 (noticeably worse)
It seems that the AV-HuBERT checkpoint got worse performance on the MuAViC data versions whenever video is involved.
I also tried running the MuAViC decoding script using the MuAViC English checkpoint on the MuAViC version of LRS3 and got the following performance:
- 433 audio-visual: 2.1941
- 433h audio-only: 3.22
- 433h video-only: 35.995
Then I tried the MuAViC decoding script, MuAViC English checkpoint, and the AV-HuBERT LRS3 dataset version:
- 433h audio-visual: 2.153 (slightly better)
- 433h audio-only: 3.225 (the same)
- 433h video-only: 34.459 (noticeably better).
The MuAViC checkpoint also gets better performance on the AV-HuBERT version of LRS3 which is kind of surprising. In both cases (AV-HuBERT checkpoint or MuAViC checkpoint), the audio-only performance stays identical.
I have also tried this with the other AV-HuBERT checkpoints and the conclusion is the same (also, the gap was more noticeable for the base models).
I wonder if MuAViC processed the LRS3 video differently than AV-HuBERT, which leads to a different performance?
Hi @roudimit ,
Thank you so much for raising this issue and so sorry for the late reply!
To be honest, I never tested our checkpoints on VSR since it was out-of-scope! However, looking at the video processing code for muavic and av-hubert, I can see there are a few differences:
- how frames are extracted from the video, av-huberts does this on the fly. MuAViC does it beforehand.
- how video is saved, both uses
ffmpeg
but a bit differently.
These are the only differences that I could find! Hope this helps.
Thanks @Anwarvic for the pointers! I tested the video loading and the video saving. The loading functions from MuAViC and AV-HuBERT load the video the same. However, the saving using ffmpeg
is different since AV-HuBERT specifies '-crf', '20'
, while MuAViC saving uses the default (I belief crf=23), which means the video frames from MuAViC are more compressed. A link for more details: https://stackoverflow.com/questions/64011346/ffmpeg-quality-conversion-options-video-compression
I'm going to leave this issue open so that others are aware of the difference between the video processing.