FIFO-Diffusion_public: A Python repository from YCmove

Improving Visual Consistency for Long Video Generation (FIFO-Diffusion + VideoCrafter)

This series began with a perplexing body-flipped video ... see more in this article.

a person swimming in ocean, high quality, 4K resolution.

Features

Added support for Image-to-Video (I2V) generation in FIFO-Diffusion.
Improving Visual Consistency in the long video generation
1. Seeding the initial latent frame as the image embedding.
2. Use Weighted Q-caches in Spatio-Temporal Attention.
3. Extending the Latent Uniformly before the diagonal denoising.
More background about 3D U-net and Spatio-Temporal Attention in my blog

Seeding the initial latent frame

Check my article for more details.

FIFO-Diffusion	FIFO+ Initial Seeding	FIFO+ Initial Seeding (Autoregressive)

"a bicycle accelerating to gain speed, high quality, 4K resolution."

"a bicycle slowing down to stop, high quality, 4K resolution."

Weighted Q-caches

Check my article for more details.

FIFO-Diffusion	FIFO+Q-caches

"a person swimming in ocean, high quality, 4K resolution."

"a boat sailing smoothly on a calm lake, high quality, 4K resolution."

"a bicycle leaning against a tree, high quality, 4K resolution."

Extending the Latent Uniformly

Check my article for more details.

FIFO-Diffusion	FIFO+ Uniform Latents

"a bicycle accelerating to gain speed, high quality, 4K resolution."

"a car stuck in traffic during rush hour, high quality, 4K resolution."

Installation

conda create --name fifoplus python=3.10.14
conda activate fifoplus
pip install -r requirements.txt

Downloading the Checkopoints

Model	Resolution	Checkpoint	Config
VideoCrafter2 (Text2Video)	320x512	Hugging Face	Link
VideoCrafter1 (Image2Video)	320x512	Hugging Face	Link

Directory structure:

. FIFO-Diffusion_public
    ├──configs
    │     ├── inference_i2v_512_v1.0.yaml
    │     └── inference_t2v_512_v2.0.yaml
    ├──videocrafter_models
    │     ├── base_512_v2
    │     │        └── model.ckpt
    │     └── Image2Video_512
    ..             └── model.ckpt

Usage

Prompts files

For t2v and t2v_seed, the txt filw should look like

{prompt1}
{prompt2}
...

Example:

a person swimming in ocean, high quality, 4K resolution.
a person giving a presentation to a room full of colleagues, high quality, 4K resolution.

For i2v

{image_path_1};{prompt1}
{image_path_22};{prompt2}
...

Example:

/data/vbench2/a large wave crashes over a rocky cliff.jpg;a large wave crashes over a rocky cliff, high quality, 4K resolution.
/data/vbench2/A teddy bear is climbing over a wooden fence.jpg;A teddy bear is climbing over a wooden fence, high quality, 4K resolution.

Argument `--mode {main_option}{sub_option}`

Main options:
- i2v
- t2v
- t2v_seed: Seeding the initial latent frame
Sub options:
- TTqcache_attn1: Enable Q-caches
- unilatent: Extending the Latent Uniformly

Argument `--experiment {experiment}`

This will create a folder name {experiment} under the main directory and a {experiment}.gif (or mp4).

. FIFO-Diffusion_public
    ├──results
    ..  └── videocraft_v2_fifo
              ├── latents # this stores the clean latent from base model
              └── random_noise
                    └── {prompt}
                            └──{experiment}

Inference command for main option `t2v`

python3 videocrafter_main.py \\
--config configs/inference_t2v_512_v2.0.yaml \\
--ckpt_path videocrafter_models/base_512_v2/model.ckpt \\
--prompt_file prompts/vbench_t2v_subject_consistency_debug.csv \\
--mode t2v_TTqcache_attn1_unilatent \\
--save_frames \\
--experiment t2v_TTqcache_attn1_unilatent

Inference command for main option `i2v` and `t2v_seed`

python3 videocrafter_main.py \\
--config configs/inference_i2v_512_v1.0.yaml \\
--ckpt_path videocrafter_models/Image2Video_512/model.ckpt \\
--prompt_file prompts/vbench_t2v_cohe_fromi2v.csv \\
--mode t2v_seed_TTqcache \\
--save_frames \\
--experiment t2v_seed_TTqcache

Acknowledgements

This repo is a fork of FIFO-Diffusion, using VideoCrafter as the base model. The ideas are also inspired by ConsiStory: Training-Free Consistent Text-to-Image Generation and Cross-Image Attention for Zero-Shot Appearance Transfer. Be sure to check out and cite their original publications. And I am open to any discussions on this work!

YCmove/FIFO-Diffusion_public

Improving Visual Consistency for Long Video Generation (FIFO-Diffusion + VideoCrafter)

Features

Seeding the initial latent frame

Weighted Q-caches

Extending the Latent Uniformly

Installation

Downloading the Checkopoints

Usage

Prompts files

Argument --mode {main_option}{sub_option}

Argument --experiment {experiment}

Inference command for main option t2v

Inference command for main option i2v and t2v_seed

Acknowledgements

Argument `--mode {main_option}{sub_option}`

Argument `--experiment {experiment}`

Inference command for main option `t2v`

Inference command for main option `i2v` and `t2v_seed`