train_your_own_sora: A Python repository from MaTriXy

Latte Text to Video Training

Latte is by far the closest to SORA among the open-source video generation models.

Original Latte didn't provide text to video training code. We reproduced the paper and implemented the text to video training based on the paper.

Please find out more details from the paper:

Latte: Latent Diffusion Transformer for Video Generation

Improments

The following improvements are implemented to the training code:

added the support of gradient accumulation (config: gradient_accumulation_steps)
added valiation samples generation to generate (config: validation) testing videos in the training process
added wandb support
added classifier-free guidance training (config: cfg_random_null_text_ratio)

Step 1: setup the environment

First, download and set up the repo:

git clone https://github.com/lyogavin/Latte_t2v_training.git
conda env create -f environment.yml
conda activate latte

If you find it too complicated to setup the environment and solve all the package versions, cuda drivers, etc, you can try our vast.ai template here.

Step 2: download pretrained model

You can download the pretrained model as follows:

sudo apt-get install git-lfs # or: sudo yum install git-lfs
git lfs install

git clone --depth=1 --no-single-branch  https://huggingface.co/maxin-cn/Latte /root/pretrained_Latte/

Step 4: prepare training data

Put video files in a directory and create a csv file to specify the prompt for each video.

The csv file format:

video_file_name	prompt
VIDEO_FILE_001.mp4	PROMPT_001
VIDEO_FILE_002.mp4	PROMPT_002
...	...

Step 5: config

Config is in configs/t2v/t2v_img_train.yaml and it's pretty self-explanotary.

A few config entries to note:

point video_folder and csv_path to the path of training data
point pretrained_model_path to the t2v_required_models directory of downloaded model.
point pretrained to the t2v.pt file in the downloaded model
You can change text_prompt under validation section to the testing validation prompts. During the training process every ckpt_every steps, it'll test generating videos based on the prompts and publish to wandb for you to checkout.