Eagle 2.5 is a family of frontier vision-language models (VLMs) designed for long-context multimodal learning. While most existing VLMs focus on short-context tasks, Eagle 2.5 addresses the challenges of long video comprehension and high-resolution image understanding, providing a generalist framework for both. Eagle 2.5 supports up to 512 video frames and is trained jointly on image + video data.
We also introduce Eagle-Video-110K, a novel dataset with both story-level and clip-level annotations, specifically curated for long video understanding. The dataset contains over 110K annotated samples, including QA, localization, and summarization. The videos range from a few minutes to 3 hours - pushing the limits of long-form visual reasoning.
๐ Strong Results Across The Board:
SOTA on 6 out of 10 long video benchmarks
Outperforms GPT-4o (0806) on 3/5 video tasks
Outperforms Gemini 1.5 Pro on 4/6 video tasks
Matches or outperforms Qwen2.5-VL-72B on multiple key datasets
72.4% on Video-MME with 512 input frames
Strong image understanding with consistent improvement over Eagle 2, matching Qwen2.5-VL.
๐ฏ Key Innovations
Information-First Sampling:
Image Area Preservation (IAP): Optimizes image tiling to retain most of the original image area and aspect ratio, preserving fine-grained details.
Automatic Degrade Sampling (ADS): Dynamically balances visual and textual input, ensuring complete text retention while maximizing visual content within context length constraints.
Progressive Mixed Post-Training:
Gradually increases context length during training, enhancing the model's ability to process varying input sizes and improving information density over static sampling.
Diversity-Driven Data Recipe:
Combines open-source data (human-annotated and synthetic) with the self-curated Eagle-Video-110K dataset, collected via a diversity-driven strategy and annotated with both story-level and clip-level QA pairs.
All numbers are directly extracted from Table 2 and Table 3 of the Eagle 2.5 Tech Report.
Citation
If you find this project useful, please cite our work:
@article{chen2025eagle2.5,
title={Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models},
author={Chen, Guo and Li, Zhiqi and Wang, Shihao and Jiang, Jindong and Liu, Yicheng and Lu, Lidong and Huang, De-An and Byeon, Wonmin and Le, Matthieu and Ehrlich, Max and Lu, Tong and Wang, Limin and Catanzaro, Bryan and Kautz, Jan and Tao, Andrew and Yu, Zhiding and Liu, Guilin},
journal={arXiv:2504.15271},
year={2025}
}
@article{li2025eagle2buildingposttraining,
title={Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models},
author={Zhiqi Li and Guo Chen and Shilong Liu and Shihao Wang and Vibashan VS and Yishen Ji and Shiyi Lan and Hao Zhang and Yilin Zhao and Subhashree Radhakrishnan and Nadine Chang and Karan Sapra and Amala Sanjay Deshmukh and Tuomas Rintamaki and Matthieu Le and Ilia Karmanov and Lukas Voegtle and Philipp Fischer and De-An Huang and Timo Roman and Tong Lu and Jose M. Alvarez and Bryan Catanzaro and Jan Kautz and Andrew Tao and Guilin Liu and Zhiding Yu},
journal={arXiv:2501.14818},
year={2025}
}
@inproceedings{shi2025eagle,
title = {Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders},
author={Min Shi and Fuxiao Liu and Shihao Wang and Shijia Liao and Subhashree Radhakrishnan and De-An Huang and Hongxu Yin and Karan Sapra and Yaser Yacoob and Humphrey Shi and Bryan Catanzaro and Andrew Tao and Jan Kautz and Zhiding Yu and Guilin Liu},
booktitle={ICLR},
year={2025}
}
License/Terms of Use
The code is released under the Apache 2.0 license as found in the LICENSE file.
The pretrained model weights are released under the NVIDIA License
The service is a research preview intended for non-commercial use only, and is subject to the following licenses and terms: