/LaViLa

Code release for "Learning Video Representations from Large Language Models"

Primary LanguagePythonMIT LicenseMIT

Learning Video Representations from Large Language Models

Learning Video Representations from Large Language Models
Yue Zhao, Ishan Misra, Philipp Krähenbühl, Rohit Girdhar
CVPR 2023 (Highlight, acceptance rate≈2.5%)
arxiv | bibtex | colab | 🤗 demo | website

LaViLa (Language augmented Video Language Pretraining) is a new approach to learning video representations from Large Language Models (LLMs). We repurpose LLMs to be visually conditioned "Narrators", and use them to automatically generate video-language paired data. We use this data to then learn a video-langauge representation, outperforming prior work by large margins.

Sample Generations:

Video Generation 1 Generation 2
so now we're going to slice the bread now i'm going to do is just slice
this up into a nice chunk and
then we're going to place it
on the plate

Try out our Narrator to generate text descriptions for your own videos! You can also try out a web demo here: Hugging Face Spaces

The resulting video-language model sets a new state-of-the-art on a number of popular video tasks! image

Introduction and installation

LaViLa leverages Large Language Models (LLMs) as "NARRATOR"s (and "REPHRASER"s) to densely narrate long videos, and uses these narrations to train strong dual-encoder models.

See INSTALL.md to install this code.

NARRATOR

NARRATOR is a visually conditioned LLM that takes videos frames as input and pseudo-labels this clip with narrations.

NARRATOR Demo

We provide some generated samples by our NARRATOR:

Human
narration
C separates the yarn. C lifts container. C opterates the camera.
NARRATOR generation (a) C stetches the thread with both hands. C wipes the countertop with a sponge. C takes a photo shot.
NARRATOR generation (b) C pulls out the yarn with her right hand. C moves the container. A man X looks at the camera.

Run the narrator demo using Colab (no GPU needed): Open In Colab
or on the web using 🤗 Spaces: Hugging Face Spaces (thanks to @nateraw!)

Since Colab free account offers very limited RAM, if you'd like to run the demo with a larger model, please run ./demo_narrator.py locally. For more technical details, please refer to Sec 4.1 in our paper.

# CPU mode
python demo_narrator.py [--video-path $TEST_VIDEO]

# GPU mode
python demo_narrator.py --cuda

Our narrator also works on third-person videos! Below are several examples generated by our NARRATOR that is pre-trained on HowTo100M Auto-Aligned (HTM-AA) and applied to some stock footage video clips. Note that since the text corpus in HowTo100M is ASR transcription, the style of narration is slightly different from that of ground-truth captions. However the generated results are generally reasonable.

GT caption Pastry chef cutting bread into
slices during the preparation
of a dessert, inside a kitchen.
Close-up shot of the hands
of an experienced baker
skillfully kneading bread dough.
Chef preparing a sauce in
a blender, adding different
ingredients while blending.
NARRATOR (a) so now we're going to slice the bread i'm gonna make a little hole
in the middle of the dough here
all right let's blend this up
NARRATOR (b) now i'm going to do is just slice
this up into a nice chunk and
then we're going to place it
on the plate
you just keep kneading it the last step to making this
is to blend the ingredients
in the food processor

Below is a demo for 3rd-person videos.

python demo_narrator_3rd_person.py [--video-path $TEST_VIDEO] [--cuda]

Dual-Encoder

The dual-encoder model contains a video encoder and a text encoder. It learns video-langauge representation from both human annotations and generated narrations using a contrastive loss like CLIP.

  • LaViLa's dual-encoder achieves excellent zero-shot performance on a wide range of egocentric benchmarks, outperforming previous state-of-the-art video-language pretraining methods by a large margin.

    Backbone EK-100 MIR
    avg. mAP^
    EK-100 MIR
    avg. nDCG^
    Charades-Ego
    mAP
    EGTEA
    mean acc.
    EgoMCQ
    intra-video acc.
    Prev. SOTA^^ TSF-B 22.1/23.3 22.1/27.9 25.2 17.6 57.2
    LAVILA TSF-B 29.7/30.9 31.5/32.0 26.8 28.9 59.9
    LAVILA TSF-L 35.0/36.1 34.2/34.6 28.9 34.1 63.1

    ^ The two numbers are obtained by using different number of frames as input (4-frame and 16-frame).

    ^^ We use the checkpoints released by EgoVLP and convert them to be compatible with this codebase. Also note that our reproduced numbers are better than the reported numbers, especially on EK-100 MIR since we evaluate on raw videos directly (for more details, check out Appendix F & Table 10 in our paper).

    For details on how to get the numbers, please refer to MODEL_ZOO.md.

  • Once fine-tuned on the down-stream dataset, LaViLa's dual-encoder can also achieve state-of-the-art results on it. We show some key results as follows.

    EK-100 MIR
    avg. mAP
    EK-100 MIR
    avg. nDCG
    EK-100 CLS
    Action top-1
    Charades-Ego
    mAP
    EGTEA
    mean acc.
    Prev. SOTA 45.0 59.4 50.5 32.1 65.9
    LAVILA 50.9 66.5 50.9 36.1 76.0

    For details on how to fine-tune the pre-trained dual-encoder on down-stream datasets, please refer to MODEL_ZOO.md.

License

The majority of LAVILA is licensed under a MIT License, however portions of the project are available under separate license terms:

Citing LaViLa

@inproceedings{zhao2023lavila,
  title={Learning Video Representations from Large Language Models},
  author={Zhao, Yue and Misra, Ishan and Kr{\"a}henb{\"u}hl, Philipp and Girdhar, Rohit},
  booktitle={CVPR},
  year={2023}
}