mbzuai-oryx/Video-LLaVA

Using ASR caption instead of heavy audio encoder can be more efficient

lucasjinreal opened this issue · 0 comments

Audio has info redundancy compare with picture.