LibertFan's Stars
huggingface/lerobot
🤗 LeRobot: Making AI for Robotics more accessible with end-to-end learning
LargeWorldModel/LWM
Large World Model -- Modeling Text and Video with Millions Context
xlang-ai/OSWorld
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
THUDM/AgentTuning
AgentTuning: Enabling Generalized Agent Abilities for LLMs
ShareGPT4Omni/ShareGPT4Video
[NeurIPS 2024] An official implementation of ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
yaodongC/awesome-instruction-dataset
A collection of open-source dataset to train instruction-following LLMs (ChatGPT,LLaMA,Alpaca)
wilson1yan/VideoGPT
robocasa/robocasa
RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots
LAION-AI/aesthetic-predictor
A linear estimator on top of clip to predict the aesthetic quality of pictures
Vision-CAIR/ChatCaptioner
Official Repository of ChatCaptioner
WooooDyy/AgentGym
Code and implementations for the paper "AgentGym: Evolving Large Language Model-based Agents across Diverse Environments" by Zhiheng Xi et al.
njucckevin/SeeClick
The model, data and code for the visual GUI Agent SeeClick
allenai/WildBench
Benchmarking LLMs with Challenging Tasks from Real Users
victorsungo/MMDialog
The official site of paper MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation
PKU-EPIC/DexGraspNet
XiaoxiaoGuo/fashion-iq
google-research/android_world
AndroidWorld is an environment and benchmark for autonomous agents
Yushi-Hu/VisualSketchpad
Codes for Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models
alipay/Ant-Multi-Modal-Framework
Research Code for Multimodal-Cognition Team in Ant Group
yjy0625/equibot
Official implementation for paper "EquiBot: SIM(3)-Equivariant Diffusion Policy for Generalizable and Data Efficient Learning".
LibertFan/AI_Hospital
AI Hospital: Interactive Evaluation and Collaboration of LLMs as Intern Doctors for Clinical Diagnosis
OpenGVLab/MMT-Bench
ICML'2024 | MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
yiye3/GUICourse
GUICourse: From General Vision Langauge Models to Versatile GUI Agents
princeton-nlp/CharXiv
[NeurIPS 2024] CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
MILVLG/activitynet-qa
An VideoQA dataset based on the videos from ActivityNet
prometheus-eval/prometheus-vision
[ACL 2024 Findings & ICLR 2024 WS] An Evaluator VLM that is open-source, offers reproducible evaluation, and inexpensive to use. Specifically designed for fine-grained evaluation on customized score rubric, Prometheus-Vision is a good alternative for human evaluation and GPT-4V evaluation.
google-research-datasets/screen_annotation
The Screen Annotation dataset consists of pairs of mobile screenshots and their annotations. The annotations are in text format, and describe the UI elements present on the screen: their type, location, OCR text and a short description. It has been introduced in the paper `ScreenAI: A Vision-Language Model for UI and Infographics Understanding`.
KwanWaiChung/MT-Eval
Code and data for "MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models"
chuyg1005/seeclick-crawler
SkyworkAI/agent-studio
Environments, tools, and benchmarks for general computer agents