EfficientARV

This project focuses on the development of an efficient autoregressive model for the joint generation of images and videos. The aim is to enhance the generative capabilities of current multimodal large models, ultimately building an interactive world model that integrates both multimodal understanding and generation.

🔆 New Features/Updates

Trained the first version on a relatively small dataset

TODO list

Implement the training and testing pipeline
Test different AR generation schemes
Jointly train image and video generation model on a larger dataset with larger resolutions
Implement an efficient unstructured image-video tokenizer
Integrate generation capabilities into MLLMs
Support multiple conditional generation tasks: image animation, image/video inpainting/outpainting, video prediction, video interpolation
Support multimodal controllable generation