This project focuses on the development of an efficient autoregressive model for the joint generation of images and videos. The aim is to enhance the generative capabilities of current multimodal large models, ultimately building an interactive world model that integrates both multimodal understanding and generation.
- Trained the first version on a relatively small dataset
- Implement the training and testing pipeline
- Test different AR generation schemes
- Jointly train image and video generation model on a larger dataset with larger resolutions
- Implement an efficient unstructured image-video tokenizer
- Integrate generation capabilities into MLLMs
- Support multiple conditional generation tasks: image animation, image/video inpainting/outpainting, video prediction, video interpolation
- Support multimodal controllable generation
"time lapse of a cloudy sky" | "countryside top view" | "a blue and cloudy sky" | "aerial view of brown dry landscape" |
"waterfalls in between mountain" | "view of the amazon river" | "a river waterfall cascading down the plunge basin" | "flooded landscape with palm trees" |
"drone shot of an abandoned coliseum on a snowy mountain top" | "clouds over mountain" | "aerial view of road in forest" | "a peaceful lake" |
Coming soon
Coming soon