KastanDay/video-pretrained-transformer
Multi-model video-to-text by combining embeddings from Flan-T5 + CLIP + Whisper + SceneGraph. The 'backbone LLM' is pre-trained from scratch on YouTube (YT-1B dataset).
Jupyter NotebookMIT
Issues
- 1
Pretrain checkpoints
#1 opened by vishalgv263