Learning Video Representations from Large Language Models
Learning Video Representations from Large Language Models
Yue Zhao, Ishan Misra, Philipp Krähenbühl, Rohit Girdhar
arxiv | bibtex | colab | 🤗 demo | website
LaViLa (Language augmented Video Language Pretraining) is a new approach to learning video representations from Large Language Models (LLMs). We repurpose LLMs to be visually conditioned "Narrators", and use them to automatically generate video-language paired data. We use this data to then learn a video-langauge representation, outperforming prior work by large margins.
Sample Generations:
Video | Generation 1 | Generation 2 |
---|---|---|
so now we're going to slice the bread | now i'm going to do is just slice this up into a nice chunk and then we're going to place it on the plate |
Try out our Narrator to generate text descriptions for your own videos! You can also try out a web demo here:
The resulting video-language model sets a new state-of-the-art on a number of popular video tasks!
Introduction and installation
LaViLa leverages Large Language Models (LLMs) as "NARRATOR"s (and "REPHRASER"s) to densely narrate long videos, and uses these narrations to train strong dual-encoder models.
See INSTALL.md to install this code.
NARRATOR
NARRATOR is a visually conditioned LLM that takes videos frames as input and pseudo-labels this clip with narrations.
NARRATOR Demo
We provide some generated samples by our NARRATOR:
Run the narrator demo using Colab (no GPU needed):
or on the web using 🤗 Spaces: (thanks to @nateraw!)
Since Colab free account offers very limited RAM, if you'd like to run the demo with a larger model, please run ./demo_narrator.py locally. For more technical details, please refer to Sec 4.1 in our paper.
# CPU mode
python demo_narrator.py [--video-path $TEST_VIDEO]
# GPU mode
python demo_narrator.py --cuda
Our narrator also works on third-person videos! Below are several examples generated by our NARRATOR that is pre-trained on HowTo100M Auto-Aligned (HTM-AA) and applied to some stock footage video clips. Note that since the text corpus in HowTo100M is ASR transcription, the style of narration is slightly different from that of ground-truth captions. However the generated results are generally reasonable.
Below is a demo for 3rd-person videos.
python demo_narrator_3rd_person.py [--video-path $TEST_VIDEO] [--cuda]
Dual-Encoder
The dual-encoder model contains a video encoder and a text encoder. It learns video-langauge representation from both human annotations and generated narrations using a contrastive loss like CLIP.
-
LaViLa's dual-encoder achieves excellent zero-shot performance on a wide range of egocentric benchmarks, outperforming previous state-of-the-art video-language pretraining methods by a large margin.
Backbone EK-100 MIR
avg. mAP^EK-100 MIR
avg. nDCG^Charades-Ego
mAPEGTEA
mean acc.EgoMCQ
intra-video acc.Prev. SOTA^^ TSF-B 22.1/23.3 22.1/27.9 25.2 17.6 57.2 LAVILA TSF-B 29.7/30.9 31.5/32.0 26.8 28.9 59.9 LAVILA TSF-L 35.0/36.1 34.2/34.6 28.9 34.1 63.1 ^ The two numbers are obtained by using different number of frames as input (4-frame and 16-frame).
^^ We use the checkpoints released by EgoVLP and convert them to be compatible with this codebase. Also note that our reproduced numbers are better than the reported numbers, especially on EK-100 MIR since we evaluate on raw videos directly (for more details, check out Appendix F & Table 10 in our paper).
For details on how to get the numbers, please refer to MODEL_ZOO.md.
-
Once fine-tuned on the down-stream dataset, LaViLa's dual-encoder can also achieve state-of-the-art results on it. We show some key results as follows.
EK-100 MIR
avg. mAPEK-100 MIR
avg. nDCGEK-100 CLS
Action top-1Charades-Ego
mAPEGTEA
mean acc.Prev. SOTA 45.0 59.4 50.5 32.1 65.9 LAVILA 50.9 66.5 50.9 36.1 76.0 For details on how to fine-tune the pre-trained dual-encoder on down-stream datasets, please refer to MODEL_ZOO.md.
License
The majority of LAVILA is licensed under a MIT License, however portions of the project are available under separate license terms:
-
https://github.com/EGO4D/episodic-memory is licensed under the MIT license.
-
The videos of cutting a loaf, kneading a dough, and preparing a sauce in a blender are licensed under the Mixkit Stock Video Free License.
Citing LaViLa
@inproceedings{zhao2022lavila,
title={Learning Video Representations from Large Language Models},
author={Zhao, Yue and Misra, Ishan and Kr{\"a}henb{\"u}hl, Philipp and Girdhar, Rohit},
booktitle={arXiv preprint arXiv:2212.04501},
year={2022}
}