/video-pretrained-transformer

Multi-model video-to-text by combining embeddings from Flan-T5 + CLIP + Whisper + SceneGraph. The 'backbone LLM' is pre-trained from scratch on YouTube (YT-1B dataset).

Primary LanguageJupyter NotebookMIT LicenseMIT

πŸ“ƒ Intuition

I'm building my own multi media GPT; a competitor to Merlot Reserve & Vid2Seq. It's pre-trained from scratch on youtube data, mostly the YT-1B dataset of 20M curated youtube videos containing significant spoken language (English only).

πŸ“œ Arxiv: https://arxiv.org/abs/2304.10505

πŸ‘‰ Project highlights & intuition with photos, check it out: https://twitter.com/KastanDay/status/1595991960380411905

(No 3D) VPT Architecture Diagram

My design follows the "Embedding + Trunk + Head" pattern I first noticed succeeding in DETER and Alphafold2. Now in early 2023, it's successful in PALM-E and Vid2Seq from Google, and Prismer from Nvidia, and many more listed on my Twitter announcement.

πŸš€ Quickstart

  1. Install Git LFS
# Install `git-lfs` (via apt or brew)
brew install git-lfs
-OR-
conda install -c conda-forge -y git-lfs

Then start GitLFS

git-lfs install
  1. Install ffmpeg

A simple install should work fine, despite how convoluted the library tends to be.

# preffered
sudo apt update && sudo apt install ffmpeg
-OR-
# conda method is not as well tested for this project
conda install -c conda-forge -y ffmpeg
# An update command might be necessary to get all of ffmpeg's codec-specifc extensions, which we need. 
# solves error in parallel_whisper.py: ❌❌Error during whisper: Expecting value: line 1 column 1 (char 0)
conda update ffmpeg
  1. Clone the repo with our custom submodules
git clone --recurse-submodules git@github.com:KastanDay/video-pretrained-transformer.git
  1. Install pip requirements
pip install -r ./requirements.txt

Later, if updates are made to submodules, you can pull new changes using:

git submodule update --remote

We have submodules in the first place because we needed to modify the internal logic of three libraries used in preprocessing: Lhotse (to be faster), OpenPSG, and Transformers to modify the T5 implementation to suport modality encodings.

Install is complete!

Progress

  1. (Oct 2022) Start of project.
  2. (Dec 2022) MVP completed, but messed up the evaluation.
  3. (Dec 2022) Migrated all data to Deeplake database library, overall much cleaner & more reliable for distributed database updates.
  4. (Jan 2023) Migrated all training logic to Composer, by MosaicML. Super cool library for efficient LLM training, even of huggingface models.
  5. (Jan 2023) Finished scaling up distributed pre-processing (i.e. inference w/ Whisper, FlanT5, OpenS and Clip). Rock solid Deeplake distributed dataset.append() operations on any size SLURM cluster.
  6. (Feb 2023) Tested different backbones: T5 vs T5 v1.1 vs Flan-TS. Somehow, v1.1 was terrible and Flan-T5 was by far the best. As suggested by another finetuning study. The author confirmed this in my follow-up question.
  7. (Mar 2023) WIP: TVQA evaluation. Need to fit more video frames into our 1024 context window, probably by using fewer final hidden states from CLIP.

Up next:

  • Find better scene-graph implementation: conly 55 classes from COCO is not enough for YouTube data. Ours relies on Detectron2 as a base, which is great for in-domain objects but not general. I think the best we can do is to use the 1k classes from imagenet.
  • Totally reimplement sound/audio model to move away from Whisper -- I think Google's AudioSet with 600+ classes based on YouTube data, will enable the best models. Here's my favorite from that competition.