/lite-sora

An initiative to replicate Sora

Primary LanguagePythonApache License 2.0Apache-2.0

Lite-Sora

Introduction

The lite-sora project is an initiative to replicate Sora, co-launched by East China Normal University and the ModelScope community. It aims to explore the minimal reproduction and streamlined implementation of the video generation algorithms behind Sora. We hope to provide concise and readable code to facilitate collective experimentation and improvement, continuously pushing the boundaries of open-source video generation technology.

Roadmap

  • Implement the base architecture
  • Validate on small datasets
  • Train Video Encoder & Decoder on large datasets
  • Train VideoDiT on large datasets

Usage

Python Environment

conda env create -f environment.yml
conda activate litesora

Download Models

  • models/text_encoder/model.safetensors: Stable Diffusion XL's Text Encoder. download
  • models/denoising_model/model.safetensors:We trained a denoising model using a small dataset Pixabay100. This model serves to demonstrate that our training code is capable of fitting the training data properly, with a resolution of 64*64. Obviously this model is overfitting due to the limited amount of training data, and thus it lacks generalization capability at this stage. Its purpose is solely for verifying the correctness of the training algorithm. download
  • models/vae/model.safetensors: Stable Video Diffusion's VAE. download

Training

from litesora.data import TextVideoDataset
from litesora.models import SDXLTextEncoder2
from litesora.trainers.v1 import LightningVideoDiT
import lightning as pl
import torch


if __name__ == '__main__':
    # dataset and data loader
    dataset = TextVideoDataset("data/pixabay100", "data/pixabay100/metadata.json",
                               num_frames=64, height=64, width=64)
    train_loader = torch.utils.data.DataLoader(dataset, shuffle=True, batch_size=1, num_workers=8)

    # model
    model = LightningVideoDiT(learning_rate=1e-5)
    model.text_encoder.load_state_dict_from_diffusers("models/text_encoder/model.safetensors")

    # train
    trainer = pl.Trainer(max_epochs=100000, accelerator="gpu", devices="auto", callbacks=[
        pl.pytorch.callbacks.ModelCheckpoint(save_top_k=-1)
    ])
    trainer.fit(model=model, train_dataloaders=train_loader)

While the training program is running, you can launch tensorboard to see the training loss.

tensorboard --logdir .

Inference

  • Synthesize a video in the pixel space.
from litesora.models import SDXLTextEncoder2, VideoDiT
from litesora.pipelines import PixelVideoDiTPipeline
from litesora.data import save_video
import torch


# models
text_encoder = SDXLTextEncoder2.from_diffusers("models/text_encoder/model.safetensors")
denoising_model = VideoDiT.from_pretrained("models/denoising_model/model.safetensors")

# pipeline
pipe = PixelVideoDiTPipeline(torch_dtype=torch.float16, device="cuda")
pipe.fetch_models(text_encoder, denoising_model)

# generate a video
prompt = "woman, flowers, plants, field, garden"
video = pipe(prompt=prompt, num_inference_steps=100)

# save the video (the resolution is 64*64, we enlarge it to 512*512 here)
save_video(video, "output.mp4", upscale=8)
  • Encode a video into the latent space, and then decode it.
from litesora.models import SDVAEEncoder, SVDVAEDecoder
from litesora.data import load_video, tensor2video, concat_video, save_video
import torch
from tqdm import tqdm


frames = load_video("data/pixabay100/videos/168572 (Original).mp4",
                    num_frames=1024, height=1024, width=1024, random_crop=False)
frames = frames.to(dtype=torch.float16, device="cpu")

encoder = SDVAEEncoder.from_diffusers("models/vae/model.safetensors").to(dtype=torch.float16, device="cuda")
decoder = SVDVAEDecoder.from_diffusers("models/vae/model.safetensors").to(dtype=torch.float16, device="cuda")

with torch.no_grad():
    print(frames.shape)
    latents = encoder.encode_video(frames, progress_bar=tqdm)
    print(latents.shape)
    decoded_frames = decoder.decode_video(latents, progress_bar=tqdm)

video = tensor2video(concat_video([frames, decoded_frames]))
save_video(video, "video.mp4", fps=24)

Results (Experimental)

We trained a denoising model using a small dataset Pixabay100. This model serves to demonstrate that our training code is capable of fitting the training data properly, with a resolution of 64*64. Obviously this model is overfitting due to the limited amount of training data, and thus it lacks generalization capability at this stage. Its purpose is solely for verifying the correctness of the training algorithm. download

airport, people, crowd, busy beach, ocean, waves, water, sand bee, honey, insect, beehive, nature coffee, beans, caffeine, coffee, shop
fish, underwater, aquarium, swim forest, woods, mystical, morning ocean, beach, sunset, sea, atmosphere hair, wind, girl, woman, people
reeds, grass, wind, golden, sunshine sea, ocean, seagulls, birds, sunset woman, flowers, plants, field, garden wood, anemones, wildflower, flower

We leverage the VAE model from Stable-Video-Diffusion to encode videos to the latent space. Our code supports extremely long high-resolution videos!

video_vae_compressed.mp4