CogVideo && CogVideoX

中文阅读

日本語で読む

🤗 Experience on CogVideoX Huggingface Space

📚 Check here to view Paper

👋 Join our WeChat and Discord

📍 Visit 清影 and API Platform to experience larger-scale commercial video generation models.

Update and News

🔥🔥 News: 2024/8/15: The SwissArmyTransformer dependency in CogVideoX has been upgraded to 0.4.12. Fine-tuning no longer requires installing SwissArmyTransformer from source. Additionally, the Tied VAE technique has been applied in the implementation within the diffusers library. Please install diffusers and accelerate libraries from source. Inference for CogVideoX now requires only 12GB of VRAM.
🔥 News: 2024/8/12: The CogVideoX paper has been uploaded to arxiv. Feel free to check out the paper.
🔥 News: 2024/8/7: CogVideoX has been integrated into diffusers version 0.30.0. Inference can now be performed on a single 3090 GPU. For more details, please refer to the code.
🔥 News: 2024/8/6: We have also open-sourced 3D Causal VAE used in CogVideoX-2B, which can reconstruct the video almost losslessly.
🔥 News: 2024/8/6: We have open-sourced CogVideoX-2B，the first model in the CogVideoX series of video generation models.
🌱 Source: 2022/5/19: We have open-sourced CogVideo (now you can see in CogVideo branch)，the first open-sourced pretrained text-to-video model, and you can check ICLR'23 CogVideo Paper for technical details.

More powerful models with larger parameter sizes are on the way~ Stay tuned!

Jump to a specific section:

Quick Start
- SAT
- Diffusers
CogVideoX-2B Video Works
Introduction to the CogVideoX Model
Full Project Structure
- Inference
- SAT
- Tools
Introduction to CogVideo(ICLR'23) Model
Citations
Open Source Project Plan
Model License

Quick Start

Prompt Optimization

Before running the model, please refer to this guide to see how we use large models like GLM-4 (or other comparable products, such as GPT-4) to optimize the model. This is crucial because the model is trained with long prompts, and a good prompt directly impacts the quality of the video generation.

SAT

Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.

Follow instructions in sat_demo: Contains the inference code and fine-tuning code of SAT weights. It is recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform rapid stacking and development. (18 GB for inference, 40GB for lora finetune)

Diffusers

Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.

pip install -r requirements.txt

Then follow diffusers_demo: A more detailed explanation of the inference code, mentioning the significance of common parameters. (24GB for inference,fine-tuned code are under development)

CogVideoX-2B Gallery

1.mp4

A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea. The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse. Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood, with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting.

2.mp4

The camera follows behind a white vintage SUV with a black roof rack as it speeds up a steep dirt road surrounded by pine trees on a steep mountain slope, dust kicks up from its tires, the sunlight shines on the SUV as it speeds along the dirt road, casting a warm glow over the scene. The dirt road curves gently into the distance, with no other cars or vehicles in sight. The trees on either side of the road are redwoods, with patches of greenery scattered throughout. The car is seen from the rear following the curve with ease, making it seem as if it is on a rugged drive through the rugged terrain. The dirt road itself is surrounded by steep hills and mountains, with a clear blue sky above with wispy clouds.

3.mp4

A street artist, clad in a worn-out denim jacket and a colorful bandana, stands before a vast concrete wall in the heart, holding a can of spray paint, spray-painting a colorful bird on a mottled wall.

4.mp4

In the haunting backdrop of a war-torn city, where ruins and crumbled walls tell a story of devastation, a poignant close-up frames a young girl. Her face is smudged with ash, a silent testament to the chaos around her. Her eyes glistening with a mix of sorrow and resilience, capturing the raw emotion of a world that has lost its innocence to the ravages of conflict.

Model Introduction

CogVideoX is an open-source version of the video generation model, which is homologous to 清影.

The table below shows the list of video generation models we currently provide, along with related basic information:

Model Name	CogVideoX-2B
Prompt Language	English
Single GPU Inference (FP16)	18GB using SAT 23.9GB using diffusers
Multi GPUs Inference (FP16)	20GB minimum per GPU using diffusers
GPU Memory Required for Fine-tuning(bs=1)	40GB
Prompt Max Length	226 Tokens
Video Length	6 seconds
Frames Per Second	8 frames
Resolution	720 * 480
Quantized Inference	Not Supported
Download Link (HF diffusers Model)	🤗 Huggingface 🤖 ModelScope 💫 WiseModel
Download Link (SAT Model)	SAT

Friendly Links

We highly welcome contributions from the community and actively contribute to the open-source community. The following works have already been adapted for CogVideoX, and we invite everyone to use them:

Xorbits Inference: A powerful and comprehensive distributed inference framework, allowing you to easily deploy your own models or the latest cutting-edge open-source models with just one click.

Project Structure

This open-source repository will guide developers to quickly get started with the basic usage and fine-tuning examples of the CogVideoX open-source model.

Inference

diffusers_demo: A more detailed explanation of the inference code, mentioning the significance of common parameters.
diffusers_vae_demo: Executing the VAE inference code alone currently requires 71GB of memory, but it will be optimized in the future.
convert_demo: How to convert user input into a format suitable for CogVideoX. Because CogVideoX is trained on long caption, we need to convert the input text to be consistent with the training distribution using a LLM. By default, the script uses GLM4, but it can also be replaced with any other LLM such as GPT, Gemini, etc.
gradio_web_demo: A simple gradio web UI demonstrating how to use the CogVideoX-2B model to generate videos. Same as Our Huggingface Space, you can use this script to launch a web demo.

cd inference
# For Linux and Windows users (and macOS with Intel??)
python gradio_web_demo.py # humans mode

# For macOS with Apple Silicon users, Intel not supported, this maybe 20x slower than RTX 4090
PYTORCH_ENABLE_MPS_FALLBACK=1 python gradio_web_demo.py # humans mode

streamlit_web_demo: A simple streamlit web application demonstrating how to use the CogVideoX-2B model to generate videos.

sat

sat_demo: Contains the inference code and fine-tuning code of SAT weights. It is recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform rapid stacking and development.

Tools

This folder contains some tools for model conversion / caption generation, etc.

convert_weight_sat2hf: Convert SAT model weights to Huggingface model weights.
caption_demo: Caption tool, a model that understands videos and outputs them in text.

CogVideo(ICLR'23)

The official repo for the paper: CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers is on the CogVideo branch

CogVideo is able to generate relatively high-frame-rate videos. A 4-second clip of 32 frames is shown below.

cogvideo.mp4

The demo for CogVideo is at https://models.aminer.cn/cogvideo, where you can get hands-on practice on text-to-video generation. The original input is in Chinese.

Citation

🌟 If you find our work helpful, please leave us a star and cite our paper.

@article{yang2024cogvideox,
  title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
  author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
  journal={arXiv preprint arXiv:2408.06072},
  year={2024}
}
@article{hong2022cogvideo,
  title={CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers},
  author={Hong, Wenyi and Ding, Ming and Zheng, Wendi and Liu, Xinghan and Tang, Jie},
  journal={arXiv preprint arXiv:2205.15868},
  year={2022}
}

Open Source Project Plan

Open source CogVideoX model
- Open source 3D Causal VAE used in CogVideoX.
- CogVideoX model inference example (CLI / Web Demo)
- CogVideoX online experience demo (Huggingface Space)
- CogVideoX open source model API interface example (Huggingface)
- CogVideoX model fine-tuning example (SAT)
- CogVideoX model fine-tuning example (Huggingface / SAT)
- Open source CogVideoX-Pro (adapted for CogVideoX-2B suite)
- Release CogVideoX technical report

We welcome your contributions. You can click here for more information.

Model License

The code in this repository is released under the Apache 2.0 License.

The model weights and implementation code are released under the CogVideoX LICENSE.

avat2/CogVideo