facebookresearch/muavic

VSR performance lower on MuAViC version of LRS3 (En)

roudimit opened this issue · 2 comments

Hi, thanks for your nice work! I preprocessed the MuAViC dataset according to the instructions. I already had LRS3 processed according to the AV-HuBERT instructions, so I wanted to test if a pre-trained model would get the same performance on both the AV-HuBERT dataset version and the MuAViC version of LRS3.

I first tried ckpt=large_noise_pt_noise_ft_433h.pt from AV-HuBERT, and ran this command:

python -B infer_s2s.py --config-dir ./conf/ --config-name s2s_decode.yaml \
  dataset.gen_subset=test common_eval.path=${ckpts_dir}/${ckpt} \
  common_eval.results_path=${exp_dir}/av-hubert/decode/s2s/test \
  override.modalities=['audio', 'video'] override.data=${lrs3_dir}/30h_data override.label_dir=${lrs3_dir}/30h_data common.user_dir=`pwd`

Using the AV-HuBERT version of LRS3:

  • 433 audio-visual: 1.486
  • 433h audio-only: 1.951
  • 433h video-only: 34.135

Using the MuAViC version of LRS3:

  • 433 audio-visual: 1.496 (slightly worse)
  • 433h audio-only: 1.951 (the same)
  • 433h video-only: 35.995 (noticeably worse)

It seems that the AV-HuBERT checkpoint got worse performance on the MuAViC data versions whenever video is involved.

I also tried running the MuAViC decoding script using the MuAViC English checkpoint on the MuAViC version of LRS3 and got the following performance:

  • 433 audio-visual: 2.1941
  • 433h audio-only: 3.22
  • 433h video-only: 35.995

Then I tried the MuAViC decoding script, MuAViC English checkpoint, and the AV-HuBERT LRS3 dataset version:

  • 433h audio-visual: 2.153 (slightly better)
  • 433h audio-only: 3.225 (the same)
  • 433h video-only: 34.459 (noticeably better).

The MuAViC checkpoint also gets better performance on the AV-HuBERT version of LRS3 which is kind of surprising. In both cases (AV-HuBERT checkpoint or MuAViC checkpoint), the audio-only performance stays identical.
I have also tried this with the other AV-HuBERT checkpoints and the conclusion is the same (also, the gap was more noticeable for the base models).
I wonder if MuAViC processed the LRS3 video differently than AV-HuBERT, which leads to a different performance?

Hi @roudimit ,

Thank you so much for raising this issue and so sorry for the late reply!

To be honest, I never tested our checkpoints on VSR since it was out-of-scope! However, looking at the video processing code for muavic and av-hubert, I can see there are a few differences:

  • how frames are extracted from the video, av-huberts does this on the fly. MuAViC does it beforehand.
  • how video is saved, both uses ffmpeg but a bit differently.

These are the only differences that I could find! Hope this helps.

Thanks @Anwarvic for the pointers! I tested the video loading and the video saving. The loading functions from MuAViC and AV-HuBERT load the video the same. However, the saving using ffmpeg is different since AV-HuBERT specifies '-crf', '20', while MuAViC saving uses the default (I belief crf=23), which means the video frames from MuAViC are more compressed. A link for more details: https://stackoverflow.com/questions/64011346/ffmpeg-quality-conversion-options-video-compression

I'm going to leave this issue open so that others are aware of the difference between the video processing.