/minisora

The Mini Sora project aims to explore the implementation path and future development direction of Sora.

Primary LanguagePythonApache License 2.0Apache-2.0

Mini Sora Community

Contributors Forks Issues MIT License Stargazers

 

English | 简体中文

👋 join us on WeChat

The Mini Sora open-source community is positioned as a community-driven initiative (free of charge and devoid of any exploitation) organized spontaneously by community members. The Mini Sora project aims to explore the implementation path and future development direction of Sora.

  • Regular roundtable discussions will be held with the Sora team and the community to explore possibilities.
  • We will delve into existing technological pathways for video generation.

Hot News

Paper Reproduction Group

Project Page

Reproduction Goals

  1. GPU-Friendly: Ideally, it should have low requirements for GPU memory size and the number of GPUs, such as being trainable and inferable with compute power like 8 A100 80G cards, 8 A6000 48G cards, or RTX4090 24G.
  2. Training-Efficiency: It should achieve good results without requiring extensive training time.
  3. Inference-Efficiency: When generating videos during inference, there is no need for high length or resolution; acceptable parameters include 3-10 seconds in length and 480p resolution.

Recent Roundtable Discussions

Sora Night Talk on Video Diffusion Overview

Zhihu Notes: A Survey on Generative Diffusion Model: An Overview of Generative Diffusion Models

Paper Reading Program

Recruitment of Presenters

Related Work

Diffusion Models

Paper Link
1) Guided-Diffusion: Diffusion Models Beat GANs on Image Synthesis NeurIPS 21 Paper, Github
2) Latent Diffusion: High-Resolution Image Synthesis with Latent Diffusion Models CVPR 22 Paper, Github
3) EDM: Elucidating the Design Space of Diffusion-Based Generative Models NeurIPS 22 Paper, Github
4) DDPM: Denoising Diffusion Probabilistic Models NeurIPS 20 Paper, Github
5) DDIM: Denoising Diffusion Implicit Models ICLR 21 Paper, Github
6) Score-Based Diffusion: Score-Based Generative Modeling through Stochastic Differential Equations ICLR 21 Paper, Github, Blog
7) Stable Cascade: Würstchen: An efficient architecture for large-scale text-to-image diffusion models ICLR 24 Paper, Github, Blog
8) Diffusion Models in Vision: A Survey TPAMI 23 Paper, Github

Diffusion Transformer

Paper Link
1) UViT: All are Worth Words: A ViT Backbone for Diffusion Models CVPR 23 Paper, Github, ModelScope
2) DiT: Scalable Diffusion Models with Transformers ICCV 23 Paper, Github, ModelScope
3) SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers Paper, Github, ModelScope
4) FiT: Flexible Vision Transformer for Diffusion Model Paper, Github
5) k-diffusion: Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers Paper, Github
6) OpenDiT: An Easy, Fast and Memory-Efficient System for DiT Training and Inference Github
7) Large-DiT: Large Diffusion Transformer Github
8) VisionLLaMA: A Unified LLaMA Interface for Vision Tasks Paper, Github

Video Generation

Paper Link
1) Animatediff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning ICLR 24 Paper, Github, ModelScope
2) I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models Paper, Github, ModelScope
3) Imagen Video: High Definition Video Generation with Diffusion Models Paper
4) MoCoGAN: Decomposing Motion and Content for Video Generation CVPR 18 Paper
5) Adversarial Video Generation on Complex Datasets Paper
6) W.A.L.T: Photorealistic Video Generation with Diffusion Models Paper Project
7) VideoGPT: Video Generation using VQ-VAE and Transformers Paper, Github
8) Video Diffusion Models Paper, Github, Project
9) MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation NeurIPS 22 Paper, Github, Project, Blog
10) VideoPoet: A Large Language Model for Zero-Shot Video Generation Paper
11) MAGVIT: Masked Generative Video Transformer CVPR 23 Paper, Github, Project, Colab
12) EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions Paper, Github, Project
13) SimDA: Simple Diffusion Adapter for Efficient Video Generation Paper, Github, Project
14) StableVideo: Text-driven Consistency-aware Diffusion Video Editing ICCV 23 Paper, Github, Project
15) SVD: Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets Paper, Github
16) ADD: Adversarial Diffusion Distillation Paper, Github
17) GenTron: Diffusion Transformers for Image and Video Generation CVPR 24 Paper, Project

Patches Project

Paper Link
1) Interactive Video Stylization Using Few-Shot Patch-Based Training Paper,Github
2) Zoom-VQA: Patches, Frames and Clips Integration for Video Quality Assessment Paper,Github
3) FlexiViT: One Model for All Patch Sizes Paper,Github

Long-context

Paper Link
1) World Model on Million-Length Video And Language With RingAttention Paper, Github
2) Ring Attention with Blockwise Transformers for Near-Infinite Context Paper, Github
3) Extending LLMs' Context Window with 100 Samples Paper, Github
4) Efficient Streaming Language Models with Attention Sinks ICLR 24 Paper, Github
5) The What, Why, and How of Context Length Extension Techniques in Large Language Models – A Detailed Survey Paper
6) MovieChat: From Dense Token to Sparse Memory for Long Video Understanding CVPR 24 Paper, Github, Project

Base Video Generation Models

Paper Link
1) ViViT: A Video Vision Transformer ICCV 21 Paper, Github
2) VideoLDM: Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models CVPR 23 Paper
3) LVDM: Latent Video Diffusion Models for High-Fidelity Long Video Generation Paper, Github
4) LFDM: Conditional Image-to-Video Generation with Latent Flow Diffusion Models CVPR 23 Paper, Github
5) MotionDirector: Motion Customization of Text-to-Video Diffusion Models Paper, Github
6) TGAN-ODE: Latent Neural Differential Equations for Video Generation Paper, Github
7) Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators Paper, Github
8) VideoCrafter1: Open Diffusion Models for High-Quality Video Generation Paper, Github
9) VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models Paper, Github

Audio Related Resource

Paper Link
1) Stable Audio: Fast Timing-Conditioned Latent Audio Diffusion Link
2) MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation CVPR 23 Paper, Github
3) Pengi: An Audio Language Model for Audio Tasks NeurIPS 23 Paper, Github
4) Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset NeurlPS 23 Paper, Github
5)NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality Paper, Github
6) NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers Paper, Github
7) UniAudio: An Audio Foundation Model Toward Universal Audio Generation Paper, Github

Consistency

Paper Link
1) Layered Neural Atlases for Consistent Video Editing TOG 21 Paper, Github, Project,
2) StableVideo: Text-driven Consistency-aware Diffusion Video Editing ICCV 23 Paper, Github, Project
3) CoDeF: Content Deformation Fields for Temporally Consistent Video Processing Paper, Github, Project
4) Consistency Models ICML 23 Paper, Github

Prompt Engineering

Paper Link
1) RealCompo: Dynamic Equilibrium between Realism and Compositionality Improves Text-to-Image Diffusion Models Paper, Github, Project
2) Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs Paper, Github
3) LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models TMLR 23 Paper, Github
4) LLM BLUEPRINT: ENABLING TEXT-TO-IMAGE GEN-ERATION WITH COMPLEX AND DETAILED PROMPTS ICLR 24 Paper, Github
5) Progressive Text-to-Image Diffusion with Soft Latent Direction Paper
6) Self-correcting LLM-controlled Diffusion Models CVPR 24 Paper, Github
7) LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation MM 23 Paper
8) LayoutGPT: Compositional Visual Planning and Generation with Large Language Models NeurIPS 23 Paper, Github
9) Gen4Gen: Generative Data Pipeline for Generative Multi-Concept Composition Paper, Github
10) InstructEdit: Improving Automatic Masks for Diffusion-based Image Editing With User Instructions Paper, Github
11) Controllable Text-to-Image Generation with GPT-4 Paper
12) LLM-grounded Video Diffusion Models ICLR 24 Paper
13) VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning Paper
14) FlowZero: Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic Scene Syntax Paper
15) VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM Paper
16) Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator NeurIPS 23 Paper
17) Empowering Dynamics-aware Text-to-Video Diffusion with Large Language Models Paper
18) MotionZero: Exploiting Motion Priors for Zero-shot Text-to-Video Generation Paper
19) GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning Paper

Security

Paper Link

World Model

Paper Link
1) NExT-GPT: Any-to-Any Multimodal LLM Paper, Github

Dataset

Dataset Name Link
1) HD-VILA-100M: Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions Paper,Github,
2) HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips Paper, Github, Project,
3) YT-Temporal-180M: A dataset for learning multimodal script knowledge derived from 6 million public YouTube videos Paper, Github, Project,
4) Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers Paper, Github, Project,
5) UCF101: Action Recognition Data Set Paper, Project,
6) video2dataset: A Simple Tool For Large Video Dataset Curation Tool, Github,

Existing high-quality resources

Resources Link
1) Datawhale - AI视频生成学习 Feishu doc
2) A Survey on Generative Diffusion Model TKDE 24 Paper, Github
3) Awesome-Video-Diffusion-Models: A Survey on Video Diffusion Models Paper, Github
4) Awesome-Text-To-Video:A Survey on Text-to-Video Generation/Synthesis Github
5) video-generation-survey: A reading list of video generation Github
6) Awesome-Video-Diffusion Github
7) Video Generation Task in Papers With Code Task
8) Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models Paper, Github
9) Open-Sora-Plan (PKU-YuanGroup) Github
10) State of the Art on Diffusion Models for Visual Computing Paper
11) Diffusion Models: A Comprehensive Survey of Methods and Applications CSUR 24 Paper, Github
12) Generate Impressive Videos with Text Instructions: A Review of OpenAI Sora, Stable Diffusion, Lumiere and Comparable Paper
13) On the Design Fundamentals of Diffusion Models: A Survey Paper
14) Efficient Diffusion Models for Vision: A Survey Paper
15) Text-to-Image Diffusion Models in Generative AI: A Survey Paper
16) Awesome-Diffusion-Transformers GitHub, Page

Mini Sora WeChat Community Exchange Group

 

Star History

Star History Chart

How to Contribute to the Mini Sora Community

We greatly appreciate your contributions to the Mini Sora open-source community and helping us make it even better than it is now!

For more details, please refer to the Contribution Guidelines

Community contributors