zuo1188's Stars
microsoft/TaskMatrix
haotian-liu/LLaVA
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
microsoft/unilm
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
IDEA-Research/Grounded-Segment-Anything
Grounded SAM: Marrying Grounding DINO with Segment Anything & Stable Diffusion & Recognize Anything - Automatically Detect , Segment and Generate Anything
facebookresearch/ImageBind
ImageBind One Embedding Space to Bind Them All
IDEA-Research/GroundingDINO
[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
princeton-vl/infinigen
Infinite Photorealistic Worlds using Procedural Generation
amazon-science/mm-cot
Official implementation for "Multimodal Chain-of-Thought Reasoning in Language Models" (stay tuned and more will be updated)
OpenGVLab/Ask-Anything
[CVPR2024 Highlight][VideoChatGPT] ChatGPT with video understanding! And many more supported LMs such as miniGPT4, StableLM, and MOSS.
gligen/GLIGEN
Open-Set Grounded Text-to-Image Generation
microsoft/X-Decoder
[CVPR 2023] Official Implementation of X-Decoder for generalized decoding for pixel, image and language
wzzheng/TPVFormer
[CVPR 2023] An academic alternative to Tesla's occupancy network for autonomous driving.
hustvl/MapTR
[ICLR'23 Spotlight & IJCV'24] MapTR: Structured Modeling and Learning for Online Vectorized HD Map Construction
facebookresearch/home-robot
Mobile manipulation research tools for roboticists
hustvl/VAD
[ICCV 2023] VAD: Vectorized Scene Representation for Efficient Autonomous Driving
JeffWang987/OpenOccupancy
[ICCV 2023] OpenOccupancy: A Large Scale Benchmark for Surrounding Semantic Occupancy Perception
OpenDriveLab/OccNet
[ICCV 2023] OccNet: Scene as Occupancy
Vision-CAIR/ChatCaptioner
Official Repository of ChatCaptioner
DerryHub/BEVFormer_tensorrt
BEVFormer inference on TensorRT, including INT8 Quantization and Custom TensorRT Plugins (float/half/half2/int8).
OpenGVLab/Instruct2Act
Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model
zhangyp15/OccFormer
[ICCV 2023] OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction
atfortes/Awesome-Multimodal-Reasoning
Collection of papers and resources on Multimodal Reasoning, including Vision-Language Models, Multimodal Chain-of-Thought, Visual Inference, and others.
autonomousvision/nuplan_garage
[ARXIV'23] Parting with Misconceptions about Learning-based Vehicle Motion Planning
PrieureDeSion/drive-any-robot
Official code and checkpoint release for "GNM: A General Navigation Model to Drive Any Robot".
Tsinghua-MARS-Lab/ViP3D
JonDoe-297/cross-view
[CVPR'21] Projecting Your View Attentively: Monocular Road Scene Layout Estimation via Cross-view Transformation
JinkyuKimUCB/BDD-X-dataset
Berkeley Deep Drive-X (eXplanation) dataset
PrieureDeSion/visualnav-transformer
Official code and checkpoint release for "ViNT: A Foundation Model for Visual Navigation".
Vision-CAIR/3DCoMPaT-v2
3DCoMPaT++: An improved large-scale 3D vision dataset for compositional recognition
tomguluson92/SCAT
SCAT: Stride Consistency with Auto-regressive regressor and Transformer for hand pose estimation (ICCVW 2021)