Pinned Repositories
ByteTrack
[ECCV 2022] ByteTrack: Multi-Object Tracking by Associating Every Detection Box
FlashVideo
[AAAI-2026]FlashVideo: Flowing Fidelity to Detail for Efficient High-Resolution Video Generation
GLEE
[CVPR2024 Highlight]GLEE: General Object Foundation Model for Images and Videos at Scale
Groma
[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization
Infinity
[CVPR 2025 Oral]Infinity ∞ : Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis
Liquid
(Accepted by IJCV) Liquid: Language Models are Scalable and Unified Multi-modal Generators
LlamaGen
Autoregressive Model Beats Diffusion: 🦙 Llama for Scalable Image Generation
VAR
[NeurIPS 2024 Best Paper Award][GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". An *ultra-simple, user-friendly yet state-of-the-art* codebase for autoregressive image generation!
VNext
Next-generation Video instance recognition framework on top of Detectron2 which supports InstMove (CVPR 2023), SeqFormer(ECCV Oral), and IDOL(ECCV Oral))
Waver
Industry-level video foundation model for unified Text-to-Video (T2V) and Image-to-Video (I2V) generation.
FoundationVision's Repositories
FoundationVision/VAR
[NeurIPS 2024 Best Paper Award][GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". An *ultra-simple, user-friendly yet state-of-the-art* codebase for autoregressive image generation!
FoundationVision/ByteTrack
[ECCV 2022] ByteTrack: Multi-Object Tracking by Associating Every Detection Box
FoundationVision/LlamaGen
Autoregressive Model Beats Diffusion: 🦙 Llama for Scalable Image Generation
FoundationVision/Infinity
[CVPR 2025 Oral]Infinity ∞ : Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis
FoundationVision/GLEE
[CVPR2024 Highlight]GLEE: General Object Foundation Model for Images and Videos at Scale
FoundationVision/Waver
Industry-level video foundation model for unified Text-to-Video (T2V) and Image-to-Video (I2V) generation.
FoundationVision/Liquid
(Accepted by IJCV) Liquid: Language Models are Scalable and Unified Multi-modal Generators
FoundationVision/VNext
Next-generation Video instance recognition framework on top of Detectron2 which supports InstMove (CVPR 2023), SeqFormer(ECCV Oral), and IDOL(ECCV Oral))
FoundationVision/Groma
[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization
FoundationVision/FlashVideo
[AAAI-2026]FlashVideo: Flowing Fidelity to Detail for Efficient High-Resolution Video Generation
FoundationVision/UniTok
[NeurIPS 2025 Spotlight] A Unified Tokenizer for Visual Generation and Understanding
FoundationVision/OmniTokenizer
[NeurIPS 2024]OmniTokenizer: one model and one weight for image-video joint tokenization.
FoundationVision/UniRef
[ICCV2023] Segment Every Reference Object in Spatial and Temporal Spaces
FoundationVision/InfinityStar
[NeurIPS 2025 Oral]Infinity⭐️: Unified Spacetime AutoRegressive Modeling for Visual Generation
FoundationVision/GenerateU
[CVPR2024] Generative Region-Language Pretraining for Open-Ended Object Detection
FoundationVision/vaex
🔥stable, simple, state-of-the-art VQVAE toolkit & cookbook
FoundationVision/BitVAE
official training and inference code of bitwise tokenizer
FoundationVision/.github
FoundationVision/flashvideo-page
FoundationVision/infinity.project