LLM-in-Vision

Recent LLM (Large Language Models)-based CV and multi-modal works. Welcome to comment/contribute!

2023.9

(arXiv 2023.9) ImageBind-LLM: Multi-modality Instruction Tuning, [Paper], [Code]
(arXiv 2023.9) Developmental Scaffolding with Large Language Models, [Paper]
(arXiv 2023.9) Gesture-Informed Robot Assistance via Foundation Models, [Paper], [Project]
(arXiv 2023.9) Zero-Shot Recommendations with Pre-Trained Large Language Models for Multimodal Nudging, [Paper]
(arXiv 2023.9) Large AI Model Empowered Multimodal Semantic Communications, [Paper]
(arXiv 2023.9) CoTDet: Affordance Knowledge Prompting for Task Driven Object Detection, [Paper], [Project]
(arXiv 2023.9) Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning, [Paper]
(arXiv 2023.9) CIEM: Contrastive Instruction Evaluation Method for Better Instruction Tuning, [Paper]
(arXiv 2023.9) Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following, [Paper], [Code]

2023.8

(arXiv 2023.8) Improving Knowledge Extraction from LLMs for Task Learning through Agent Analysis, [Paper]
(arXiv 2023.8) Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models, [Paper], [Code]
(arXiv 2023.8) PointLLM: Empowering Large Language Models to Understand Point Clouds, [Paper], [Project]
(arXiv 2023.8) TouchStone: Evaluating Vision-Language Models by Language Models, [Paper], [Code]
(arXiv 2023.8) WALL-E: Embodied Robotic WAiter Load Lifting with Large Language Model, [Paper]
(arXiv 2023.8) ISR-LLM: Iterative Self-Refined Large Language Model for Long-Horizon Sequential Task Planning, [Paper], [Code]
(arXiv 2023.8) LLM-Based Human-Robot Collaboration Framework for Manipulation Tasks, [Paper]
(arXiv 2023.8) Evaluation and Analysis of Hallucination in Large Vision-Language Models, [Paper]
(arXiv 2023.8) MLLM-DataEngine: An Iterative Refinement Approach for MLLM, [Paper]
(arXiv 2023.8) Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models, [Paper]
(arXiv 2023.8) Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language Pretraining? [Paper], [Code]
(arXiv 2023.8) VIGC: Visual Instruction Generation and Correction, [Paper]
(arXiv 2023.8) Towards Realistic Zero-Shot Classification via Self Structural Semantic Alignment, [Paper]
(arXiv 2023.8) Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities, [Paper], [Code]
(arXiv 2023.8) DIFFUSION LANGUAGE MODELS CAN PERFORM MANY TASKS WITH SCALING AND INSTRUCTION-FINETUNING, [Paper], [Code]
(arXiv 2023.8) CHORUS: Learning Canonicalized 3D Human-Object Spatial Relations from Unbounded Synthesized Images, [Paper], [Project]
(arXiv 2023.8) ProAgent: Building Proactive Cooperative AI with Large Language Models, [Paper], [Project]
(arXiv 2023.8) ROSGPT_Vision: Commanding Robots Using Only Language Models’ Prompts, [Paper], [Code]
(arXiv 2023.8) StoryBench: A Multifaceted Benchmark for Continuous Story Visualization, [Paper], [Code]
(arXiv 2023.8) Tackling Vision Language Tasks Through Learning Inner Monologues, [Paper]
(arXiv 2023.8) ExpeL: LLM Agents Are Experiential Learners, [Paper]
(arXiv 2023.8) On the Adversarial Robustness of Multi-Modal Foundation Models, [Paper]
(arXiv 2023.8) WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models, [Paper], [Project]
(arXiv 2023.8) March in Chat: Interactive Prompting for Remote Embodied Referring Expression, [Paper], [Code]
(arXiv 2023.8) BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions, [Paper], [Code]
(arXiv 2023.8) VIT-LENS: Towards Omni-modal Representations, [Paper], [Code]
(arXiv 2023.8) StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data, [Paper], [Project]
(arXiv 2023.8) PUMGPT: A Large Vision-Language Model for Product Understanding, [Paper]
(arXiv 2023.8) Link-Context Learning for Multimodal LLMs, [Paper], [Code]
(arXiv 2023.8) Detecting and Preventing Hallucinations in Large Vision Language Models, [Paper]
(arXiv 2023.8) VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use, [Paper], [Project]
(arXiv 2023.8) Foundation Model based Open Vocabulary Task Planning and Executive System for General Purpose Service Robots, [Paper]
(arXiv 2023.8) LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation, [Paper], [Project]
(arXiv 2023.8) OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion and Infinite Data Generation, [Paper]
(arXiv 2023.8) EMPOWERING VISION-LANGUAGE MODELS TO FOLLOW INTERLEAVED VISION-LANGUAGE INSTRUCTIONS, [Paper], [Code]
(arXiv 2023.8) 3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment, [Paper], [Project]
(arXiv 2023.8) Gentopia.AI: A Collaborative Platform for Tool-Augmented LLMs, [Paper], [Project]
(arXiv 2023.8) AgentBench: Evaluating LLMs as Agents, [Paper], [Project]
(arXiv 2023.8) Learning Concise and Descriptive Attributes for Visual Recognition, [Paper]
(arXiv 2023.8) Tiny LVLM-eHub: Early Multimodal Experiments with Bard, [Paper], [Project]
(arXiv 2023.8) MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities, [Paper], [Code]
(arXiv 2023.8) RegionBLIP: A Unified Multi-modal Pre-training Framework for Holistic and Regional Comprehension, [Paper], [Code]
(arXiv 2023.8) Learning to Model the World with Language, [Paper], [Project]
(arXiv 2023.8) The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World, [Paper], [Code]
(arXiv 2023.8) Multimodal Neurons in Pretrained Text-Only Transformers, [Paper], [Project]
(arXiv 2023.8) LISA: REASONING SEGMENTATION VIA LARGE LANGUAGE MODEL, [Paper], [Code]

2023.7

(arXiv 2023.7) DesCo: Learning Object Recognition with Rich Language Descriptions, [Paper]
(arXiv 2023.7) KOSMOS-2: Grounding Multimodal Large Language Models to the World, [Paper], [Project]
(arXiv 2023.7) MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models, [Paper], [Code]
(arXiv 2023.7) Evaluating ChatGPT and GPT-4 for Visual Programming, [Paper]
(arXiv 2023.7) SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension, [Paper], [Code]
(arXiv 2023.7) AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos? [Paper], [Project]
(arXiv 2023.7) Bridging the Gap: Exploring the Capabilities of Bridge-Architectures for Complex Visual Reasoning Tasks, [Paper]
(arXiv 2023.7) MovieChat: From Dense Token to Sparse Memory for Long Video Understanding, [Paper], [Project]
(arXiv 2023.7) Large Language Models as General Pattern Machines, [Paper], [Project]
(arXiv 2023.7) How Good is Google Bard’s Visual Understanding? An Empirical Study on Open Challenges, [Paper], [Project]
(arXiv 2023.7) RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control, [Paper], [Project]
(arXiv 2023.7) Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition, [Paper], [Project]
(arXiv 2023.7) GraspGPT: Leveraging Semantic Knowledge from a Large Language Model for Task-Oriented Grasping, [Paper], [Project]
(arXiv 2023.7) CARTIER: Cartographic lAnguage Reasoning Targeted at Instruction Execution for Robots, [Paper]
(arXiv 2023.7) 3D-LLM: Injecting the 3D World into Large Language Models, [Paper], [Project]
(arXiv 2023.7) Generative Pretraining in Multimodality, [Paper], [Code]
(arXiv 2023.7) VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models, [Paper], [Project]
(arXiv 2023.7) VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View, [Paper]
(arXiv 2023.7) SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Task Planning, [Paper], [Project]
(arXiv 2023.7) Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts, [Paper]
(arXiv 2023.7) InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation, [Paper], [Data]
(arXiv 2023.7) MBLIP: EFFICIENT BOOTSTRAPPING OF MULTILINGUAL VISION-LLMS, [Paper], [Code]
(arXiv 2023.7) Bootstrapping Vision-Language Learning with Decoupled Language Pre-training, [Paper]
(arXiv 2023.7) BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs, [Paper], [Project]
(arXiv 2023.7) ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning, [Paper], [Project]
(arXiv 2023.7) TOWARDS A UNIFIED AGENT WITH FOUNDATION MODELS, [Paper]
(arXiv 2023.7) Robots That Ask For Help: Uncertainty Alignment for Large Language Model Planners, [Paper], [Project]
(arXiv 2023.7) Building Cooperative Embodied Agents Modularly with Large Language Models, [Paper], [Project]
(arXiv 2023.7) Embodied Task Planning with Large Language Models, [Paper], [Project]
(arXiv 2023.7) What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?, [Paper], [Project]
(arXiv 2023.7) GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest, [Paper], [Code]
(arXiv 2023.7) JourneyDB: A Benchmark for Generative Image Understanding, [Paper], [Code]
(arXiv 2023.7) DoReMi: Grounding Language Model by Detecting and Recovering from Plan-Execution Misalignment, [Paper], [Project]
(arXiv 2023.7) Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset, [Paper], [Code]
(arXiv 2023.7) Visual Instruction Tuning with Polite Flamingo, [Paper], [Code]
(arXiv 2023.7) Statler: State-Maintaining Language Models for Embodied Reasoning, [Paper], [Project]
(arXiv 2023.7) SCITUNE: Aligning Large Language Models with Scientific Multimodal Instructions, [Paper]
(arXiv 2023.7) SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs, [Paper], [Code]
(arXiv 2023.7) KITE: Keypoint-Conditioned Policies for Semantic Manipulation, [Paper], [Project]

2023.6

(arXiv 2023.6) LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark, [Paper], [Code]
(arXiv 2023.6) Scalable 3D Captioning with Pretrained Models, [Paper], [Code]
(arXiv 2023.6) AutoTAMP: Autoregressive Task and Motion Planning with LLMs as Translators and Checkers, [Paper], [Code]
(arXiv 2023.6) VALLEY: VIDEO ASSISTANT WITH LARGE LANGUAGE MODEL ENHANCED ABILITY, [Paper], [Code]
(arXiv 2023.6) Pave the Way to Grasp Anything: Transferring Foundation Models for Universal Pick-Place Robots, [Paper]
(arXiv 2023.6) LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models, [Paper]
(arXiv 2023.6) AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn, [Paper], [Project]
(arXiv 2023.6) Towards AGI in Computer Vision: Lessons Learned from GPT and Large Language Models, [Paper]
(arXiv 2023.6) MACAW-LLM: MULTI-MODAL LANGUAGE MODELING WITH IMAGE, AUDIO, VIDEO, AND TEXT INTEGRATION, [Paper], [Code]
(arXiv 2023.6) Investigating Prompting Techniques for Zero- and Few-Shot Visual Question Answering, [Paper]
(arXiv 2023.6) Language to Rewards for Robotic Skill Synthesis, [Paper], [Project]
(arXiv 2023.6) Toward Grounded Social Reasoning, [Paper], [Code]
(arXiv 2023.6) Improving Image Captioning Descriptiveness by Ranking and LLM-based Fusion, [Paper], [Code]
(arXiv 2023.6) RM-PRT: Realistic Robotic Manipulation Simulator and Benchmark with Progressive Reasoning Tasks, [Paper], [Code]
(arXiv 2023.6) Aligning Large Multi-Modal Model with Robust Instruction Tuning, [Paper], [Project]
(arXiv 2023.6) Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language, [Paper], [Code]
(arXiv 2023.6) LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding, [Paper], [Project]
(arXiv 2023.6) OpenShape: Scaling Up 3D Shape Representation Towards Open-World Understanding, [Paper], [Project]
(arXiv 2023.6) Statler: State-Maintaining Language Models for Embodied Reasoning, [Paper], [Project]
(arXiv 2023.6) CLARA: Classifying and Disambiguating User Commands for Reliable Interactive Robotic Agents, [Paper]
(arXiv 2023.6) Mass-Producing Failures of Multimodal Systems with Language Models, [Paper], [Code]
(arXiv 2023.6) SoftGPT: Learn Goal-oriented Soft Object Manipulation Skills by Generative Pre-trained Heterogeneous Graph Transformer, [Paper]
(arXiv 2023.6) SPRINT: SCALABLE POLICY PRE-TRAINING VIA LANGUAGE INSTRUCTION RELABELING, [Paper], [Project]
(arXiv 2023.6) MotionGPT: Finetuned LLMs are General-Purpose Motion Generators, [Paper], [Project]
(arXiv 2023.6) MIMIC-IT: Multi-Modal In-Context Instruction Tuning, [Paper], [Code]
(arXiv 2023.6) Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models, [Paper]

2023.5

(arXiv 2023.5) Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering, [Paper], [Code]
(arXiv 2023.5) VIMA: General Robot Manipulation with Multimodal Prompts, [Paper], [Project]
(arXiv 2023.5) TidyBot: Personalized Robot Assistance with Large Language Models, [Paper], [Project]
(arXiv 2023.5) Training Diffusion Models with Reinforcement Learning, [Paper], [Project]
(arXiv 2023.5) EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought, [Paper], [Project]
(arXiv 2023.5) ArtGPT-4: Artistic Vision-Language Understanding with Adapter-enhanced MiniGPT-4, [Paper], [Code]
(arXiv 2023.5) Evaluating Object Hallucination in Large Vision-Language Models, [Paper], [Code]
(arXiv 2023.5) LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation, [Paper], [Code]
(arXiv 2023.5) VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks, [Paper], [Code]
(arXiv 2023.5) OpenShape: Scaling Up 3D Shape Representation Towards Open-World Understanding, [Paper], [Project]
(arXiv 2023.5) Towards A Foundation Model for Generalist Robots: Diverse Skill Learning at Scale via Automated Task and Scene Generation, [Paper]
(arXiv 2023.5) An Android Robot Head as Embodied Conversational Agent, [Paper]
(arXiv 2023.5) Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model, [Paper], [Code]
(arXiv 2023.5) Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision, [Paper], [Project]
(arXiv 2023.5) Multimodal Procedural Planning via Dual Text-Image Prompting, [Paper], [Code]
(arXiv 2023.5) ArK: Augmented Reality with Knowledge Interactive Emergent Ability, [Paper]

2023.4

(arXiv 2023.4) LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model, [Paper], [Code]
(arXiv 2023.4) Multimodal Grounding for Embodied AI via Augmented Reality Headsets for Natural Language Driven Task Planning, [Paper]
(arXiv 2023.4) mPLUG-Owl : Modularization Empowers Large Language Models with Multimodality, [Paper], [Code]
(arXiv 2023.4) ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System, [Paper], [Project]
(arXiv 2023.4) ChatABL: Abductive Learning via Natural Language Interaction with ChatGPT, [Paper]
(arXiv 2023.4) Robot-Enabled Construction Assembly with Automated Sequence Planning based on ChatGPT: RoboGPT, [Paper]
(arXiv 2023.4) Graph-ToolFormer: To Empower LLMs with Graph Reasoning Ability via Prompt Augmented by ChatGPT, [Paper], [Code]
(arXiv 2023.4) Can GPT-4 Perform Neural Architecture Search?, [Paper], [Code]
(arXiv 2023.4) MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models, [Paper], [Project]
(arXiv 2023.4) SINC: Spatial Composition of 3D Human Motions for Simultaneous Action Generation, [Paper], [Project]
(arXiv 2023.4) LLM as A Robotic Brain: Unifying Egocentric Memory and Control, [Paper]
(arXiv 2023.4) Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models, [Paper], [Project]
(arXiv 2023.4) Visual Instruction Tuning, [Paper], [Project]
(arXiv 2023.4) MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models, [Paper], [Project]
(arXiv 2023.4) RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment, [Paper], [Code]
(arXiv 2023.4) Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved With Text, [Paper], [Code]
(arXiv 2023.4) ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with GPT and Prototype Guidance, [Paper], [Code]
(arXiv 2023.4) HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face, [Paper], [Code]
(arXiv 2023.4) ERRA: An Embodied Representation and Reasoning Architecture for Long-horizon Language-conditioned Manipulation Tasks, [Paper], [Code]
(arXiv 2023.4) Advancing Medical Imaging with Language Models: A Journey from N-grams to ChatGPT, [Paper]
(arXiv 2023.4) ChatGPT Empowered Long-Step Robot Control in Various Environments: A Case Application, [Paper], [Code]
(arXiv 2023.4) OpenAGI: When LLM Meets Domain Experts, [Paper], [Code]
(arXiv 2023.4) Video ChatCaptioner: Towards the Enriched Spatiotemporal Descriptions, [Paper], [Code]

2023.3

(arXiv 2023.3) Open-World Object Manipulation using Pre-Trained Vision-Language Models, [Paper], [Project]
(arXiv 2023.3) Grounded Decoding: Guiding Text Generation with Grounded Models for Robot Control, [Paper], [Project]
(arXiv 2023.3) Task and Motion Planning with Large Language Models for Object Rearrangement, [Paper], [Project]
(arXiv 2023.3) RE-MOVE: An Adaptive Policy Design Approach for Dynamic Environments via Language-Based Feedback, [Paper], [Project]
(arXiv 2023.3) Chat with the Environment: Interactive Multimodal Perception using Large Language Models, [Paper]
(arXiv 2023.3) MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge, [Paper], [Code]
(arXiv 2023.3) DialogPaint: A Dialog-based Image Editing Model, [Paper]
(arXiv 2023.3) MM-REACT : Prompting ChatGPT for Multimodal Reasoning and Action, [Paper], [Project]
(arXiv 2023.3) eP-ALM: Efficient Perceptual Augmentation of Language Models, [Paper], [Code]
(arXiv 2023.3) Errors are Useful Prompts: Instruction Guided Task Programming with Verifier-Assisted Iterative Prompting, [Paper], [Project]
(arXiv 2023.3) LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention, [Paper], [Code]
(arXiv 2023.3) MULTIMODAL ANALOGICAL REASONING OVER KNOWLEDGE GRAPHS, [Paper], [Code]
(arXiv 2023.3) CAN LARGE LANGUAGE MODELS DESIGN A ROBOT? [Paper]
(arXiv 2023.3) Learning video embedding space with Natural Language Supervision, [Paper]
(arXiv 2023.3) Audio Visual Language Maps for Robot Navigation, [Paper], [Project]
(arXiv 2023.3) ViperGPT: Visual Inference via Python Execution for Reasoning, [Paper]
(arXiv 2023.3) ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions, [Paper], [Code]
(arXiv 2023.3) Can an Embodied Agent Find Your “Cat-shaped Mug”? LLM-Based Zero-Shot Object Navigation, [Paper], [Project]
(arXiv 2023.3) Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models, [Paper], [Code]
(arXiv 2023.3) PaLM-E: An Embodied Multimodal Language Model, [Paper], [Project]
(arXiv 2023.3) Language Is Not All You Need: Aligning Perception with Language Models, [Paper], [Code]

2023.2

(arXiv 2023.2) ChatGPT for Robotics: Design Principles and Model Abilities, , [Paper], [Code]
(arXiv 2023.2) Internet Explorer: Targeted Representation Learning on the Open Web, [Paper], [Project]

2022.11

(arXiv 2022.11) Visual Programming: Compositional visual reasoning without training, [Paper], [Project]

2022.7

(arXiv 2022.7) Language Models are General-Purpose Interfaces, [Paper], [Code]
(arXiv 2022.7) LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action, [Paper], [Project]

xiao2mo/LLM-in-Vision

LLM-in-Vision

2023.9

2023.8

2023.7

2023.6

2023.5

2023.4

2023.3

2023.2

2022.11

2022.7