Mini Sora Community

👋 join us on WeChat

The Mini Sora open-source community is positioned as a community-driven initiative (free of charge and devoid of any exploitation) organized spontaneously by community members. The Mini Sora project aims to explore the implementation path and future development direction of Sora.

Regular roundtable discussions will be held with the Sora team and the community to explore possibilities.
We will delve into existing technological pathways for video generation.

Reproduction Group of MiniSora Community

MiniSora Reproduction Group Page

Sora Reproduction Goals of MiniSora

GPU-Friendly: Ideally, it should have low requirements for GPU memory size and the number of GPUs, such as being trainable and inferable with compute power like 8 A100 80G cards, 8 A6000 48G cards, or RTX4090 24G.
Training-Efficiency: It should achieve good results without requiring extensive training time.
Inference-Efficiency: When generating videos during inference, there is no need for high length or resolution; acceptable parameters include 3-10 seconds in length and 480p resolution.

MiniSora-DiT: Reproducing the DiT Paper with XTuner

MiniSora-DiT Group Page: https://github.com/mini-sora/minisora-DiT

Requirements

We are recruiting MiniSora Community contributors to reproduce DiT using XTuner.

We hope the community member has the following characteristics:

Familiarity with the OpenMMLab MMEngine mechanism.
Familiarity with DiT.

Background

The author of DiT is the same as the author of Sora.
XTuner has the core technology to efficiently train sequences of length 1000K.

Support

Computational resources: 2*A100.
Strong supports from XTuner core developer P佬@pppppM.

Hot News

Stable Diffusion 3: MM-DiT: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Recent Roundtable Discussions

Paper Interpretation of Stable Diffusion 3 paper: MM-DiT

Speaker: MMagic Core Contributors

Live Streaming Time: 03/12 20:00

Highlights: MMagic core contributors will lead us in interpreting the Stable Diffusion 3 paper, discussing the architecture details and design principles of Stable Diffusion 3.

Please scan the QR code with WeChat to book a live video session.

Highlights from Previous Discussions

Night Talk with Sora: Video Diffusion Overview

ZhiHu Notes: A Survey on Generative Diffusion Model: An Overview of Generative Diffusion Models

Paper Reading Program

Sora: Creating video from text
Technical Report: Video generation models as world simulators
Latte: Latte: Latent Diffusion Transformer for Video Generation
DiT: Scalable Diffusion Models with Transformers
Stable Cascade (ICLR 24 Paper): Würstchen: An efficient architecture for large-scale text-to-image diffusion models
Updating...

Recruitment of Presenters

Related Work

Diffusion Models
Paper	Link
1) Guided-Diffusion: Diffusion Models Beat GANs on Image Synthesis	NeurIPS 21 Paper, GitHub
2) Latent Diffusion: High-Resolution Image Synthesis with Latent Diffusion Models	CVPR 22 Paper, GitHub
3) EDM: Elucidating the Design Space of Diffusion-Based Generative Models	NeurIPS 22 Paper, GitHub
4) DDPM: Denoising Diffusion Probabilistic Models	NeurIPS 20 Paper, GitHub
5) DDIM: Denoising Diffusion Implicit Models	ICLR 21 Paper, GitHub
6) Score-Based Diffusion: Score-Based Generative Modeling through Stochastic Differential Equations	ICLR 21 Paper, GitHub, Blog
7) Stable Cascade: Würstchen: An efficient architecture for large-scale text-to-image diffusion models	ICLR 24 Paper, GitHub, Blog
8) Diffusion Models in Vision: A Survey	TPAMI 23 Paper, GitHub
Diffusion Transformer
Paper	Link
1) UViT: All are Worth Words: A ViT Backbone for Diffusion Models	CVPR 23 Paper, GitHub, ModelScope
2) DiT: Scalable Diffusion Models with Transformers	ICCV 23 Paper, GitHub, ModelScope
3) SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers	Paper, GitHub, ModelScope
4) FiT: Flexible Vision Transformer for Diffusion Model	Paper, GitHub
5) k-diffusion: Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers	Paper, GitHub
6) OpenDiT: An Easy, Fast and Memory-Efficient System for DiT Training and Inference	GitHub
7) Large-DiT: Large Diffusion Transformer	GitHub
8) VisionLLaMA: A Unified LLaMA Interface for Vision Tasks	Paper, GitHub
9) Stable Diffusion 3: MM-DiT: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis	Paper, Blog
Baseline Video Generation Models
Paper	Link
1) ViViT: A Video Vision Transformer	ICCV 21 Paper, GitHub
2) VideoLDM: Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models	CVPR 23 Paper
3) LVDM: Latent Video Diffusion Models for High-Fidelity Long Video Generation	Paper, GitHub
4) Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators	Paper, GitHub
Video Generation
Paper	Link
1) Animatediff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning	ICLR 24 Paper, GitHub, ModelScope
2) I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models	Paper, GitHub, ModelScope
3) Imagen Video: High Definition Video Generation with Diffusion Models	Paper
4) MoCoGAN: Decomposing Motion and Content for Video Generation	CVPR 18 Paper
5) Adversarial Video Generation on Complex Datasets	Paper
6) W.A.L.T: Photorealistic Video Generation with Diffusion Models	Paper Project
7) VideoGPT: Video Generation using VQ-VAE and Transformers	Paper, GitHub
8) Video Diffusion Models	Paper, GitHub, Project
9) MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation	NeurIPS 22 Paper, GitHub, Project, Blog
10) VideoPoet: A Large Language Model for Zero-Shot Video Generation	Paper
11) MAGVIT: Masked Generative Video Transformer	CVPR 23 Paper, GitHub, Project, Colab
12) EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions	Paper, GitHub, Project
13) SimDA: Simple Diffusion Adapter for Efficient Video Generation	Paper, GitHub, Project
14) StableVideo: Text-driven Consistency-aware Diffusion Video Editing	ICCV 23 Paper, GitHub, Project
15) SVD: Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets	Paper, GitHub
16) ADD: Adversarial Diffusion Distillation	Paper, GitHub
17) GenTron: Diffusion Transformers for Image and Video Generation	CVPR 24 Paper, Project
18) LFDM: Conditional Image-to-Video Generation with Latent Flow Diffusion Models	CVPR 23 Paper, GitHub
19) MotionDirector: Motion Customization of Text-to-Video Diffusion Models	Paper, GitHub
20) TGAN-ODE: Latent Neural Differential Equations for Video Generation	Paper, GitHub
21) VideoCrafter1: Open Diffusion Models for High-Quality Video Generation	Paper, GitHub
22) VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models	Paper, GitHub
Patches Project
Paper	Link
1) Interactive Video Stylization Using Few-Shot Patch-Based Training	Paper, Github
2) Zoom-VQA: Patches, Frames and Clips Integration for Video Quality Assessment	Paper, Github
3) FlexiViT: One Model for All Patch Sizes	Paper, Github
4) MagViT2: Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation	ICLR 24 Paper, Github
5) CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers	ICLR 23 Paper, Github
6) BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation	Paper, Github
7) BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models	Paper, Github
Long-context
Paper	Link
1) World Model on Million-Length Video And Language With RingAttention	Paper, GitHub
2) Ring Attention with Blockwise Transformers for Near-Infinite Context	Paper, GitHub
3) Extending LLMs' Context Window with 100 Samples	Paper, GitHub
4) Efficient Streaming Language Models with Attention Sinks	ICLR 24 Paper, GitHub
5) The What, Why, and How of Context Length Extension Techniques in Large Language Models – A Detailed Survey	Paper
6) MovieChat: From Dense Token to Sparse Memory for Long Video Understanding	CVPR 24 Paper, GitHub, Project
Audio Related Resource
Paper	Link
1) Stable Audio: Fast Timing-Conditioned Latent Audio Diffusion	Link
2) MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation	CVPR 23 Paper, GitHub
3) Pengi: An Audio Language Model for Audio Tasks	NeurIPS 23 Paper, GitHub
4) Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset	NeurlPS 23 Paper, GitHub
5）NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality	Paper, GitHub
6) NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers	Paper, GitHub
7) UniAudio: An Audio Foundation Model Toward Universal Audio Generation	Paper, GitHub
Consistency
Paper	Link
1) Layered Neural Atlases for Consistent Video Editing	TOG 21 Paper, GitHub, Project,
2) StableVideo: Text-driven Consistency-aware Diffusion Video Editing	ICCV 23 Paper, GitHub, Project
3) CoDeF: Content Deformation Fields for Temporally Consistent Video Processing	Paper, GitHub, Project
4) Consistency Models	ICML 23 Paper, GitHub
Prompt Engineering
Paper	Link
1) RealCompo: Dynamic Equilibrium between Realism and Compositionality Improves Text-to-Image Diffusion Models	Paper, GitHub, Project
2) Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs	Paper, GitHub
3) LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models	TMLR 23 Paper, GitHub
4) LLM BLUEPRINT: ENABLING TEXT-TO-IMAGE GEN-ERATION WITH COMPLEX AND DETAILED PROMPTS	ICLR 24 Paper, GitHub
5) Progressive Text-to-Image Diffusion with Soft Latent Direction	Paper
6) Self-correcting LLM-controlled Diffusion Models	CVPR 24 Paper, GitHub
7) LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation	MM 23 Paper
8) LayoutGPT: Compositional Visual Planning and Generation with Large Language Models	NeurIPS 23 Paper, GitHub
9) Gen4Gen: Generative Data Pipeline for Generative Multi-Concept Composition	Paper, GitHub
10) InstructEdit: Improving Automatic Masks for Diffusion-based Image Editing With User Instructions	Paper, GitHub
11) Controllable Text-to-Image Generation with GPT-4	Paper
12) LLM-grounded Video Diffusion Models	ICLR 24 Paper
13) VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning	Paper
14) FlowZero: Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic Scene Syntax	Paper
15) VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM	Paper
16) Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator	NeurIPS 23 Paper
17) Empowering Dynamics-aware Text-to-Video Diffusion with Large Language Models	Paper
18) MotionZero: Exploiting Motion Priors for Zero-shot Text-to-Video Generation	Paper
19) GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning	Paper
Security
Paper	Link
1) BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset	NeurIPS 23 Paper, Github
2) LIMA: Less Is More for Alignment	NeurIPS 23 Paper
3) Jailbroken: How Does LLM Safety Training Fail?	NeurIPS 23 Paper
4) Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models	CVPR 23 Paper
5) Stable Bias: Evaluating Societal Representations in Diffusion Models	NeurIPS 23 Paper
World Model
Paper	Link
1) NExT-GPT: Any-to-Any Multimodal LLM	Paper, GitHub

Dataset
Dataset Name - Paper	Link
1) Panda-70M - Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers `70M Clips, 720P, Downloadable`	CVPR 24 Paper, Github, Project
2) InternVid-10M - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation `10M Clips, 720P, Downloadable`	ArXiv 24 Paper, Github
3) CelebV-Text - CelebV-Text: A Large-Scale Facial Text-Video Dataset `70K Clips, 720P, Downloadable`	CVPR 23 Paper, Github, Project
4) HD-VG-130M - VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation `130M Clips, 720P, Downloadable`	ArXiv 23 Paper, Github, Tool
5) HD-VILA-100M - Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions `100M Clips, 720P, Downloadable`	CVPR 22 Paper, Github
6) VideoCC - Learning Audio-Video Modalities from Image Captions `10.3M Clips, 720P, Downloadable`	ECCV 22 Paper, Github
7) YT-Temporal-180M - MERLOT: Multimodal Neural Script Knowledge Models `180M Clips, 480P, Downloadable`	NeurIPS 21 Paper, Github, Project
8) HowTo100M - HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips `136M Clips, 240P, Downloadable`	ICCV 19 Paper, Github, Project
9) UCF101 - UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild `13K Clips, 240P, Downloadable`	CVPR 12 Paper, Project
10) MSVD - Collecting Highly Parallel Data for Paraphrase Evaluation `122K Clips, 240P, Downloadable`	ACL 11 Paper, Project
Existing high-quality resources
Resources	Link
1) Datawhale - AI视频生成学习	Feishu doc
2) A Survey on Generative Diffusion Model	TKDE 24 Paper, GitHub
3) Awesome-Video-Diffusion-Models: A Survey on Video Diffusion Models	Paper, GitHub
4) Awesome-Text-To-Video：A Survey on Text-to-Video Generation/Synthesis	GitHub
5) video-generation-survey: A reading list of video generation	GitHub
6) Awesome-Video-Diffusion	GitHub
7) Video Generation Task in Papers With Code	Task
8) Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models	Paper, GitHub
9) Open-Sora-Plan (PKU-YuanGroup)	GitHub
10) State of the Art on Diffusion Models for Visual Computing	Paper
11) Diffusion Models: A Comprehensive Survey of Methods and Applications	CSUR 24 Paper, GitHub
12) Generate Impressive Videos with Text Instructions: A Review of OpenAI Sora, Stable Diffusion, Lumiere and Comparable	Paper
13) On the Design Fundamentals of Diffusion Models: A Survey	Paper
14) Efficient Diffusion Models for Vision: A Survey	Paper
15) Text-to-Image Diffusion Models in Generative AI: A Survey	Paper
16) Awesome-Diffusion-Transformers	GitHub, Project
17) Open-Sora (HPC-AI Tech)	GitHub, Blog
18) LAVIS - A Library for Language-Vision Intelligence	ACL 23 Paper, GitHub, Project

Mini Sora WeChat Community Exchange Group

Star History

How to Contribute to the Mini Sora Community

We greatly appreciate your contributions to the Mini Sora open-source community and helping us make it even better than it is now!

For more details, please refer to the Contribution Guidelines

AdamMayor2018/minisora

Mini Sora Community

Reproduction Group of MiniSora Community

Sora Reproduction Goals of MiniSora

MiniSora-DiT: Reproducing the DiT Paper with XTuner

Requirements

Background

Support

Hot News

Recent Roundtable Discussions

Paper Interpretation of Stable Diffusion 3 paper: MM-DiT

Highlights from Previous Discussions

Paper Reading Program

Recruitment of Presenters

Related Work

Diffusion Models

Diffusion Transformer

Baseline Video Generation Models

Video Generation

Patches Project

Long-context

Audio Related Resource

Consistency

Prompt Engineering

Security

World Model

Dataset

Existing high-quality resources

Mini Sora WeChat Community Exchange Group

Star History

How to Contribute to the Mini Sora Community

Community contributors