/FlashI2V

An official implementation of FlashI2V.

Apache License 2.0Apache-2.0

If you like our project, please give us a star ⭐ on GitHub for latest update.

arXiv Hugging Face Page

💡 We also have other generation projects that may interest you ✨.

Open-Sora Plan: Open-Source Large Video Generation Model
Bin Lin, Yunyang Ge and Xinhua Cheng etc.
github github arXiv

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
Bin Lin, Zongjian Li, Xinhua Cheng etc.
github github arXiv

Identity-Preserving Text-to-Video Generation by Frequency Decomposition
Shenghai Yuan, Jinfa Huang, Xianyi He etc.
github github arXiv

📣 News

  • [2025.09.30] We have uploaded the Ascend version of the training and inference code, along with the model weights. For details, please refer to the NPU branch.

🗓️ TODO

😍 Gallery

Image-to-Video Results of FlashI2V-1.3B

video_000007.mp4
video_000021.mp4
video_000027.mp4
video_000035.mp4
video_000109.mp4
video_000142.mp4
video_000149.mp4
video_000151.mp4
video_000163.mp4
video_000191.mp4
video_000273.mp4
video_000280.mp4
video_000172.mp4
video_000184.mp4
video_000214.mp4
video_000352.mp4

😮 Highlights

Overfitting to In-domain Data Causes Performance Degradation

  • Existing I2V Methods involve Conditional image leakage. (a) Conditional image leakage causes performance degradation issues, where the videos are sampled from Wan2.1-I2V-14B-480P with Vbench-I2V text-image pairs. (b) In the existing I2V paradigm, we observe that chunk-wise FVD on in-domain data increases over time, while chunk-wise FVD on out-of-domain data remains consistently high, indicating that the law learned on in-domain data by the existing paradigm fails to generalize to out-of-domain data.

Model Overview

  • We propose FlashI2V to introduce conditions implicitly. We extract features from the conditional image latents using a learnable projection, followed by the latent shifting to obtain a renewed intermediate state that implicitly contains the condition. Simultaneously, the conditional image latents undergo the Fourier Transform to extract high-frequency magnitude features as guidance, which are concatenated with noisy latents and injected into DiT. During inference, we begin with the shifted noise and progressively denoise following the ODE, ultimately decoding the video.

Best Generalization and Performance across Different I2V Paradigms

  • Comparing the chunk-wise FVD variation patterns of different I2V paradigms on both the training and validation sets, it is observed that only FlashI2V exhibits the same time-increasing FVD variation pattern in both sets. This suggests that only FlashI2V is capable of applying the generation law learned from in-domain data to out-of-domain data. Additionally, FlashI2V has the lowest out-of-domain FVD, demonstrating its performance advantage.

Vbench Results

Model I2V Paradigm Subject Consistency↑ Background Consistency↑ Motion Smoothness↑ Dynamic Degree↑ Aesthetic Quality↑ Imaging Quality↑ I2V Subject Consistency↑ I2V Background Consistency↑
SVD-XT-1.0 (1.5B) Repeating Concat and Adding Noise 95.52 96.61 98.09 52.36 60.15 69.80 97.52 97.63
SVD-XT-1.1 (1.5B) Repeating Concat and Adding Noise 95.42 96.77 98.12 43.17 60.23 70.23 97.51 97.62
SEINE-512x512 (1.8B) Inpainting 95.28 97.12 97.12 27.07 64.55 71.39 97.15 96.94
CogVideoX-5B-I2V Zero-padding Concat and Adding Noise 94.34 96.42 98.40 33.17 61.87 70.01 97.19 96.74
Wan2.1-I2V-14B-720P Inpainting 94.86 97.07 97.90 51.38 64.75 70.44 96.95 96.44
CogVideoX1.5-5B-I2V† Zero-padding Concat and Adding Noise 95.04 96.52 98.47 37.48 62.68 70.99 97.78 98.73
Wan2.1-I2V-14B-480P† Inpainting 95.68 97.44 98.46 45.20 61.44 70.37 97.83 99.08
FlashI2V† (1.3B) FlashI2V 95.13 96.36 98.35 53.01 62.34 69.41 97.67 98.72

† means testing with recaptioned text-image-pairs in Vbench-I2V.

🔒 License

🤝 Contributors

🙏 Acknowledgements

✏️ Citation

If you want to cite our work, please follow:

@misc{ge2025flashi2v,
      title={FlashI2V: Fourier-Guided Latent Shifting Prevents Conditional Image Leakage in Image-to-Video Generation}, 
      author={Yunyang Ge and Xinhua Cheng and Chengshu Zhao and Xianyi He and Shenghai Yuan and Bin Lin and Bin Zhu and Li Yuan},
      year={2025},
      eprint={2509.25187},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.25187}, 
}