MASKlll's Stars
real-stanford/scalingup
[CoRL 2023] This repository contains data generation and training code for Scaling Up & Distilling Down
clorislili/ManipLLM
The official codebase for ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation(cvpr 2024)
GeWu-Lab/DepthHelps-IROS2024
UMass-Foundation-Model/3D-VLA
[ICML 2024] 3D-VLA: A 3D Vision-Language-Action Generative World Model
bytedance/GR-MG
Official implementation of GR-MG
hkchengrex/Cutie
[CVPR 2024 Highlight] Putting the Object Back Into Video Object Segmentation
huangwl18/ReKep
ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation
openvla/openvla
OpenVLA: An open-source vision-language-action model for robotic manipulation.
QwenLM/Qwen2-VL
Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
cfeng16/UniTouch
Binding Touch to Everything: Learning Unified Multimodal Tactile Representations
mkt1412/GraspGPT_public
code implementation of GraspGPT and FoundationGrasp
BAAI-DCAI/Bunny
A family of lightweight multimodal models.
YvanYin/Metric3D
The repo for "Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image" and "Metric3Dv2: A Versatile Monocular Geometric Foundation Model..."
isl-org/ZoeDepth
Metric depth estimation from a single image
remyxai/VQASynth
Compose multimodal datasets 🎹
epic-kitchens/epic-kitchens-100-annotations
:plate_with_cutlery: Annotations for the public release of the EPIC-KITCHENS-100 dataset
bdaiinstitute/theia
Theia: Distilling Diverse Vision Foundation Models for Robot Learning
BAAI-DCAI/SpatialBot
The official repo for "SpatialBot: Precise Spatial Understanding with Vision Language Models.
graspnet/anygrasp_sdk
rail-berkeley/fmb
HCPLab-SYSU/Embodied_AI_Paper_List
[Embodied-AI-Survey-2024] Paper list and projects for Embodied AI
lucidrains/vit-pytorch
Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch
openai/CLIP
CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
facebookresearch/Ego4d
Ego4d dataset repository. Download the dataset, visualize, extract features & example usage of the dataset
vimalabs/VIMABench
Official Task Suite Implementation of ICML'23 Paper "VIMA: General Robot Manipulation with Multimodal Prompts"
intuitive-robots/mdt_policy
[RSS 2024] Code for "Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals" for CALVIN experiments with pre-trained weights
DepthAnything/Depth-Anything-V2
Depth Anything V2. A More Capable Foundation Model for Monocular Depth Estimation
huggingface/pytorch-image-models
The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more
LostXine/LLaRA
LLaRA: Large Language and Robotics Assistant
changhaonan/A3VLM
[CoRL2024] Official repo of `A3VLM: Actionable Articulation-Aware Vision Language Model`