Learning Video Representations from Large Language Models

Learning Video Representations from Large Language Models
Yue Zhao, Ishan Misra, Philipp Krähenbühl, Rohit Girdhar
CVPR 2023 (Highlight, acceptance rate≈2.5%)
arxiv | bibtex | colab | 🤗 demo | website

LaViLa (Language augmented Video Language Pretraining) is a new approach to learning video representations from Large Language Models (LLMs). We repurpose LLMs to be visually conditioned "Narrators", and use them to automatically generate video-language paired data. We use this data to then learn a video-langauge representation, outperforming prior work by large margins.

Sample Generations:

Video	Generation 1	Generation 2
	so now we're going to slice the bread	now i'm going to do is just slice this up into a nice chunk and then we're going to place it on the plate

Try out our Narrator to generate text descriptions for your own videos! You can also try out a web demo here:

The resulting video-language model sets a new state-of-the-art on a number of popular video tasks!

Introduction and installation

LaViLa leverages Large Language Models (LLMs) as "NARRATOR"s (and "REPHRASER"s) to densely narrate long videos, and uses these narrations to train strong dual-encoder models.

See INSTALL.md to install this code.

NARRATOR

NARRATOR is a visually conditioned LLM that takes videos frames as input and pseudo-labels this clip with narrations.

NARRATOR Demo

We provide some generated samples by our NARRATOR:


Human narration	C separates the yarn.	C lifts container.	C opterates the camera.
NARRATOR generation (a)	C stetches the thread with both hands.	C wipes the countertop with a sponge.	C takes a photo shot.
NARRATOR generation (b)	C pulls out the yarn with her right hand.	C moves the container.	A man X looks at the camera.

Run the narrator demo using Colab (no GPU needed):
or on the web using 🤗 Spaces: (thanks to @nateraw!)

Since Colab free account offers very limited RAM, if you'd like to run the demo with a larger model, please run ./demo_narrator.py locally. For more technical details, please refer to Sec 4.1 in our paper.

# CPU mode
python demo_narrator.py [--video-path $TEST_VIDEO]

# GPU mode
python demo_narrator.py --cuda

Our narrator also works on third-person videos! Below are several examples generated by our NARRATOR that is pre-trained on HowTo100M Auto-Aligned (HTM-AA) and applied to some stock footage video clips. Note that since the text corpus in HowTo100M is ASR transcription, the style of narration is slightly different from that of ground-truth captions. However the generated results are generally reasonable.


GT caption	Pastry chef cutting bread into slices during the preparation of a dessert, inside a kitchen.	Close-up shot of the hands of an experienced baker skillfully kneading bread dough.	Chef preparing a sauce in a blender, adding different ingredients while blending.
NARRATOR (a)	so now we're going to slice the bread	i'm gonna make a little hole in the middle of the dough here	all right let's blend this up
NARRATOR (b)	now i'm going to do is just slice this up into a nice chunk and then we're going to place it on the plate	you just keep kneading it	the last step to making this is to blend the ingredients in the food processor

Below is a demo for 3rd-person videos.

python demo_narrator_3rd_person.py [--video-path $TEST_VIDEO] [--cuda]

Dual-Encoder

The dual-encoder model contains a video encoder and a text encoder. It learns video-langauge representation from both human annotations and generated narrations using a contrastive loss like CLIP.

LaViLa's dual-encoder achieves excellent zero-shot performance on a wide range of egocentric benchmarks, outperforming previous state-of-the-art video-language pretraining methods by a large margin.

	Backbone	EK-100 MIR avg. mAP^	EK-100 MIR avg. nDCG^	Charades-Ego mAP	EGTEA mean acc.	EgoMCQ intra-video acc.
Prev. SOTA^^	TSF-B	22.1/23.3	22.1/27.9	25.2	17.6	57.2
LAVILA	TSF-B	29.7/30.9	31.5/32.0	26.8	28.9	59.9
LAVILA	TSF-L	35.0/36.1	34.2/34.6	28.9	34.1	63.1

^ The two numbers are obtained by using different number of frames as input (4-frame and 16-frame).

^^ We use the checkpoints released by EgoVLP and convert them to be compatible with this codebase. Also note that our reproduced numbers are better than the reported numbers, especially on EK-100 MIR since we evaluate on raw videos directly (for more details, check out Appendix F & Table 10 in our paper).

For details on how to get the numbers, please refer to MODEL_ZOO.md.

Once fine-tuned on the down-stream dataset, LaViLa's dual-encoder can also achieve state-of-the-art results on it. We show some key results as follows.

EK-100 MIR
avg. mAP EK-100 MIR
avg. nDCG EK-100 CLS
Action top-1 Charades-Ego
mAP EGTEA
mean acc.

Prev. SOTA 45.0 59.4 50.5 32.1 65.9

LAVILA 50.9 66.5 50.9 36.1 76.0

For details on how to fine-tune the pre-trained dual-encoder on down-stream datasets, please refer to MODEL_ZOO.md.

	EK-100 MIR avg. mAP	EK-100 MIR avg. nDCG	EK-100 CLS Action top-1	Charades-Ego mAP	EGTEA mean acc.
Prev. SOTA	45.0	59.4	50.5	32.1	65.9
LAVILA	50.9	66.5	50.9	36.1	76.0

License

The majority of LAVILA is licensed under a MIT License, however portions of the project are available under separate license terms:

https://github.com/EGO4D/episodic-memory is licensed under the MIT license.
The videos of cutting a loaf, kneading a dough, and preparing a sauce in a blender are licensed under the Mixkit Stock Video Free License.

Citing LaViLa

@inproceedings{zhao2023lavila,
  title={Learning Video Representations from Large Language Models},
  author={Zhao, Yue and Misra, Ishan and Kr{\"a}henb{\"u}hl, Philipp and Girdhar, Rohit},
  booktitle={CVPR},
  year={2023}
}