arxiv-daily

Automated deployment @ 2024-10-22 09:44:43

Add your topics and keywords in database/topic.yml You can also view historical data through the database/storage

Mutimodal

Weakly Supervised grounding

Publish Date Title Authors PDF Code
2024-02-29 How to Understand "Support"? An Implicit-enhanced Causal Inference Approach for Weakly-supervised Phrase Grounding Jiamin Luo et.al. 2402.19116v2 null
2024-01-19 Weakly Supervised Gaussian Contrastive Grounding with Large Multimodal Models for Video Question Answering Haibo Wang et.al. 2401.10711v4 link
2023-12-15 Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment Xiaoxu Xu et.al. 2312.09625v3 null
2023-12-07 Improved Visual Grounding through Self-Consistent Explanations Ruozhen He et.al. 2312.04554v1 null
2023-05-18 Weakly-Supervised Visual-Textual Grounding with Semantic Prior Refinement Davide Rigoni et.al. 2305.10913v2 link
2023-03-31 Zero-shot Referring Image Segmentation with Global-Local Context Features Seonghoon Yu et.al. 2303.17811v2 link
2022-10-09 MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning Zijia Zhao et.al. 2210.04183v3 null
2022-06-14 Beyond Grounding: Extracting Fine-Grained Event Hierarchies Across Modalities Hammad A. Ayyubi et.al. 2206.07207v3 null
2022-04-22 Hypergraph Transformer: Weakly-supervised Multi-hop Reasoning for Knowledge-based Visual Question Answering Yu-Jung Heo et.al. 2204.10448v1 link
2022-03-16 Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding Haojun Jiang et.al. 2203.08481v2 link
2022-02-09 Can Open Domain Question Answering Systems Answer Visual Knowledge Questions? Jiawen Zhang et.al. 2202.04306v1 null
2021-12-01 Weakly-Supervised Video Object Grounding via Causal Intervention Wei Wang et.al. 2112.00475v1 null
2021-09-04 Weakly Supervised Relative Spatial Reasoning for Visual Question Answering Pratyay Banerjee et.al. 2109.01934v1 null
2020-10-12 MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding Qinxin Wang et.al. 2010.05379v1 link
2020-06-17 Contrastive Learning for Weakly Supervised Phrase Grounding Tanmay Gupta et.al. 2006.09920v3 link
2019-12-01 Learning to Relate from Captions and Bounding Boxes Sarthak Garg et.al. 1912.00311v1 null
2019-08-29 Aesthetic Image Captioning From Weakly-Labelled Photographs Koustav Ghosal et.al. 1908.11310v1 null

Alignment

Publish Date Title Authors PDF Code
2024-10-18 MiCEval: Unveiling Multimodal Chain of Thought's Quality via Image Description and Reasoning Steps Xiongtao Zhou et.al. 2410.14668v1 link
2024-10-18 Few-Shot Joint Multimodal Entity-Relation Extraction via Knowledge-Enhanced Cross-modal Prompt Model Li Yuan et.al. 2410.14225v1 null
2024-10-17 Exploring the Design Space of Visual Context Representation in Video MLLMs Yifan Du et.al. 2410.13694v1 link
2024-10-17 Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant Haoran Hao et.al. 2410.13360v1 link
2024-10-17 CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models Shangda Wu et.al. 2410.13267v1 link
2024-10-16 MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models Peng Xia et.al. 2410.13085v1 link
2024-10-15 OMCAT: Omni Context Aware Transformer Arushi Goel et.al. 2410.12109v1 null
2024-10-14 MMCFND: Multimodal Multilingual Caption-aware Fake News Detection for Low-resource Indic Languages Shubhi Bansal et.al. 2410.10407v1 null
2024-10-11 Baichuan-Omni Technical Report Yadong Li et.al. 2410.08565v1 link
2024-10-10 InstructBioMol: Advancing Biomolecule Understanding and Design Following Human Instructions Xiang Zhuang et.al. 2410.07919v1 null
2024-10-09 Do better language models have crisper vision? Jona Ruthardt et.al. 2410.07173v1 null
2024-10-09 ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time Yi Ding et.al. 2410.06625v1 link
2024-10-08 LLaCA: Multimodal Large Language Continual Assistant Jingyang Qiao et.al. 2410.10868v1 null
2024-10-08 Multimodal Large Language Models and Tunings: Vision, Language, Sensors, Audio, and Beyond Soyeon Caren Han et.al. 2410.05608v1 null
2024-10-03 LLaVA-Critic: Learning to Evaluate Multimodal Models Tianyi Xiong et.al. 2410.02712v1 null
2024-10-03 From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities Wanpeng Zhang et.al. 2410.02155v2 null
2024-09-30 The age of spiritual machines: Language quietus induces synthetic altered states of consciousness in artificial intelligence Jeremy I Skipper et.al. 2410.00257v1 null
2024-09-30 Multimodal LLM Enhanced Cross-lingual Cross-modal Retrieval Yabing Wang et.al. 2409.19961v1 null
2024-09-29 A multimodal LLM for the non-invasive decoding of spoken text from brain recordings Youssef Hmamouche et.al. 2409.19710v1 null
2024-09-27 Show and Guide: Instructional-Plan Grounded Vision and Language Model Diogo Glória-Silva et.al. 2409.19074v3 link
2024-09-27 CLLMate: A Multimodal LLM for Weather and Climate Events Forecasting Haobo Li et.al. 2409.19058v1 null
2024-09-26 MMMT-IF: A Challenging Multimodal Multi-Turn Instruction Following Benchmark Elliot L. Epstein et.al. 2409.18216v1 null
2024-09-26 MIO: A Foundation Model on Multimodal Tokens Zekun Wang et.al. 2409.17692v1 null
2024-09-26 ZALM3: Zero-Shot Enhancement of Vision-Language Alignment via In-Context Information in Multi-Turn Multimodal Medical Dialogue Zhangpu Li et.al. 2409.17610v1 null
2024-09-24 M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning Taowen Wang et.al. 2409.15657v3 link
2024-09-20 MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression Comprehension Ting Liu et.al. 2409.13609v2 null
2024-09-20 AVG-LLaVA: A Large Multimodal Model with Adaptive Visual Granularity Zhibin Lan et.al. 2410.02745v2 link
2024-09-20 ChemDFM-X: Towards Large Multimodal Model for Chemistry Zihan Zhao et.al. 2409.13194v1 null
2024-09-18 Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Peng Wang et.al. 2409.12191v2 link
2024-09-17 CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional Calibration Jiahui Gao et.al. 2409.11365v2 null
2024-09-16 Model-in-the-Loop (MILO): Accelerating Multimodal AI Data Annotation with LLMs Yifan Wang et.al. 2409.10702v2 null
2024-09-16 Quantile Regression for Distributional Reward Models in RLHF Nicolai Dorka et.al. 2409.10164v1 link
2024-09-14 Constructive Approach to Bidirectional Causation between Qualia Structure and Language Emergence Tadahiro Taniguchi et.al. 2409.09413v1 null
2024-09-14 IW-Bench: Evaluating Large Multimodal Models for Converting Image-to-Web Hongcheng Guo et.al. 2409.18980v1 null
2024-09-14 From Text to Multimodality: Exploring the Evolution and Impact of Large Language Models in Medical Practice Qian Niu et.al. 2410.01812v2 null
2024-09-11 What to align in multimodal contrastive learning? Benoit Dufumier et.al. 2409.07402v1 null
2024-09-09 MLLM-FL: Multimodal Large Language Model Assisted Federated Learning on Heterogeneous and Long-tailed Data Jianyi Zhang et.al. 2409.06067v1 null
2024-09-05 ChartMoE: Mixture of Expert Connector for Advanced Chart Understanding Zhengzhuo Xu et.al. 2409.03277v1 null
2024-08-30 MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models Shuai Peng et.al. 2409.00147v1 link
2024-08-23 The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities Venkatesh Balavadhani Parthasarathy et.al. 2408.13296v1 null
2024-08-23 IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal Capabilities Bin Wang et.al. 2408.12902v1 link
2024-08-19 Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning Sriyash Poddar et.al. 2408.10075v1 null
2024-08-16 Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning Wenwen Zhuang et.al. 2408.08640v2 link
2024-08-13 CROME: Cross-Modal Adapters for Efficient Multimodal LLM Sayna Ebrahimi et.al. 2408.06610v1 null
2024-08-11 HateSieve: A Contrastive Learning Framework for Detecting and Segmenting Hateful Content in Multimodal Memes Xuanyu Su et.al. 2408.05794v1 null
2024-08-11 VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing Chunyu Qiang et.al. 2408.05758v1 null
2024-08-09 VITA: Towards Open-Source Interactive Omni Multimodal LLM Chaoyou Fu et.al. 2408.05211v2 link
2024-08-01 Mitigating Multilingual Hallucination in Large Vision-Language Models Xiaoye Qu et.al. 2408.00550v1 link
2024-07-31 Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language Models Yue Xu et.al. 2407.21659v4 link
2024-07-29 BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues Sara Sarto et.al. 2407.20341v1 link
2024-07-28 LLAVADI: What Matters For Multimodal Large Language Models Distillation Shilin Xu et.al. 2407.19409v1 null
2024-07-26 Creating an Aligned Corpus of Sound and Text: The Multimodal Corpus of Shakespeare and Milton Manex Agirrezabal et.al. 2407.18730v1 null
2024-07-26 Every Part Matters: Integrity Verification of Scientific Figures Based on Multimodal Large Language Models Xiang Shi et.al. 2407.18626v1 link

Computer Vision

OVD

Publish Date Title Authors PDF Code
2024-09-24 HA-FGOVD: Highlighting Fine-grained Attributes via Explicit Linear Composition for Open-Vocabulary Object Detection Yuqi Ma et.al. 2409.16136v1 null
2024-04-03 ALOHa: A New Measure for Hallucination in Captioning Models Suzanne Petryk et.al. 2404.02904v1 null
2024-03-21 Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection Tim Salzmann et.al. 2403.14270v2 null
2024-03-11 Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head Tiancheng Zhao et.al. 2403.06892v1 link
2023-08-25 How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection Yiyang Yao et.al. 2308.13177v2 link
2023-05-11 Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers Dahun Kim et.al. 2305.07011v4 link
2023-04-10 Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition Shuhuai Ren et.al. 2304.04704v2 link
2023-03-29 MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks Weicheng Kuo et.al. 2303.16839v3 null
2023-03-17 Investigating the Role of Attribute Context in Vision-Language Models for Object Recognition and Detection Kyle Buettner et.al. 2303.10093v2 null
2022-09-10 OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network Tiancheng Zhao et.al. 2209.05946v2 link
2022-06-12 GLIPv2: Unifying Localization and Vision-Language Understanding Haotian Zhang et.al. 2206.05836v2 link

LMM

Publish Date Title Authors PDF Code
2024-10-18 MiCEval: Unveiling Multimodal Chain of Thought's Quality via Image Description and Reasoning Steps Xiongtao Zhou et.al. 2410.14668v1 link
2024-10-18 Few-Shot Joint Multimodal Entity-Relation Extraction via Knowledge-Enhanced Cross-modal Prompt Model Li Yuan et.al. 2410.14225v1 null
2024-10-18 MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart Problems Zifeng Zhu et.al. 2410.14179v1 null
2024-10-18 Utilizing Large Language Models for Event Deconstruction to Enhance Multimodal Aspect-Based Sentiment Analysis Xiaoyong Huang et.al. 2410.14150v1 null
2024-10-18 Coherence-Driven Multimodal Safety Dialogue with Active Learning for Embodied Agents Sabit Hassan et.al. 2410.14141v1 null
2024-10-17 Generating Signed Language Instructions in Large-Scale Dialogue Systems Mert İnan et.al. 2410.14026v1 null
2024-10-17 Can MLLMs Understand the Deep Implication Behind Chinese Images? Chenhao Zhang et.al. 2410.13854v1 link
2024-10-17 Retrospective Learning from Interactions Zizhao Chen et.al. 2410.13852v1 null
2024-10-17 Harnessing Webpage UIs for Text-Rich Visual Understanding Junpeng Liu et.al. 2410.13824v2 null
2024-10-17 MobA: A Two-Level Agent System for Efficient Mobile Task Automation Zichen Zhu et.al. 2410.13757v1 null
2024-10-17 Exploring the Design Space of Visual Context Representation in Video MLLMs Yifan Du et.al. 2410.13694v1 link
2024-10-17 Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant Haoran Hao et.al. 2410.13360v1 link
2024-10-17 Representation Learning of Structured Data for Medical Foundation Models Vijay Prakash Dwivedi et.al. 2410.13351v1 null
2024-10-17 CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models Shangda Wu et.al. 2410.13267v1 link
2024-10-16 MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models Peng Xia et.al. 2410.13085v1 link
2024-10-16 WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation João Matos et.al. 2410.12722v1 link
2024-10-16 Prompt Compression for Large Language Models: A Survey Zongqian Li et.al. 2410.12388v2 link
2024-10-16 Understanding the Role of LLMs in Multimodal Evaluation Benchmarks Botian Jiang et.al. 2410.12329v1 null
2024-10-15 OMCAT: Omni Context Aware Transformer Arushi Goel et.al. 2410.12109v1 null
2024-10-15 MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation Chenxi Wang et.al. 2410.11779v1 link
2024-10-15 Magnifier Prompt: Tackling Multimodal Hallucination via Extremely Simple Instructions Yuhan Fu et.al. 2410.11701v1 null
2024-10-15 VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI Sijie Cheng et.al. 2410.11623v1 null
2024-10-15 MultiVENT 2.0: A Massive Multilingual Benchmark for Event-Centric Video Retrieval Reno Kriz et.al. 2410.11619v1 null
2024-10-15 Difficult Task Yes but Simple Task No: Unveiling the Laziness in Multimodal LLMs Sihang Zhao et.al. 2410.11437v1 link
2024-10-14 Generative AI and Its Impact on Personalized Intelligent Tutoring Systems Subhankar Maity et.al. 2410.10650v1 null
2024-10-14 MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models Peng Xia et.al. 2410.10139v1 link
2024-10-13 Empowering Dysarthric Speech: Leveraging Advanced LLMs for Accurate Speech Correction and Multimodal Emotion Analysis Kaushal Attaluri et.al. 2410.12867v1 null
2024-10-13 BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models Xinyuan Wang et.al. 2410.09804v2 null
2024-10-13 ECIS-VQG: Generation of Entity-centric Information-seeking Questions from Videos Arpan Phukan et.al. 2410.09776v1 link
2024-10-12 Reconstructive Visual Instruction Tuning Haochen Wang et.al. 2410.09575v1 null
2024-10-12 Declarative Knowledge Distillation from Large Language Models for Visual Question Answering Datasets Thomas Eiter et.al. 2410.09428v1 link
2024-10-11 M3Hop-CoT: Misogynous Meme Identification with Multimodal Multi-hop Chain-of-Thought Gitanjali Kumari et.al. 2410.09220v1 link
2024-10-11 A Social Context-aware Graph-based Multimodal Attentive Learning Framework for Disaster Content Classification during Emergencies Shahid Shafi Dar et.al. 2410.08814v1 null
2024-10-11 Baichuan-Omni Technical Report Yadong Li et.al. 2410.08565v1 link
2024-10-11 SPORTU: A Comprehensive Sports Understanding Benchmark for Multimodal Large Language Models Haotian Xia et.al. 2410.08474v2 null
2024-10-10 LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts Anh-Quan Cao et.al. 2410.08211v1 null
2024-10-10 Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training Gen Luo et.al. 2410.08202v1 null
2024-10-10 MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models Wenbo Hu et.al. 2410.08182v1 null
2024-10-10 Sample then Identify: A General Framework for Risk Control and Assessment in Multimodal Large Language Models Qingni Wang et.al. 2410.08174v1 null
2024-10-10 Agent S: An Open Agentic Framework that Uses Computers Like a Human Saaket Agashe et.al. 2410.08164v1 link
2024-10-10 Insight Over Sight? Exploring the Vision-Knowledge Conflicts in Multimodal LLMs Xiaoyuan Liu et.al. 2410.08145v1 null
2024-10-10 InstructBioMol: Advancing Biomolecule Understanding and Design Following Human Instructions Xiang Zhuang et.al. 2410.07919v1 null
2024-10-10 How Does Vision-Language Adaptation Impact the Safety of Vision Language Models? Seongyun Lee et.al. 2410.07571v1 null
2024-10-10 Thought2Text: Text Generation from EEG Signal using Large Language Models (LLMs) Abhijit Mishra et.al. 2410.07507v1 link
2024-10-09 Do better language models have crisper vision? Jona Ruthardt et.al. 2410.07173v1 null
2024-10-09 To Preserve or To Compress: An In-Depth Study of Connector Selection in Multimodal Large Language Models Junyan Lin et.al. 2410.06765v1 link
2024-10-09 Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization Changli Tang et.al. 2410.06682v2 null
2024-10-09 ING-VP: MLLMs cannot Play Easy Vision-based Games Yet Haoran Zhang et.al. 2410.06555v1 link
2024-10-09 Chip-Tuning: Classify Before Language Models Say Fangwei Zhu et.al. 2410.06541v2 link
2024-10-08 Multimodal Situational Safety Kaiwen Zhou et.al. 2410.06172v1 null
2024-10-08 PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling Xudong Xie et.al. 2410.05970v1 link
2024-10-08 LLaCA: Multimodal Large Language Continual Assistant Jingyang Qiao et.al. 2410.10868v1 null
2024-10-08 Multimodal Large Language Models and Tunings: Vision, Language, Sensors, Audio, and Beyond Soyeon Caren Han et.al. 2410.05608v1 null
2024-10-07 Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents Boyu Gou et.al. 2410.05243v1 link
2024-10-07 MINER: Mining the Underlying Pattern of Modality-Specific Neurons in Multimodal Large Language Models Kaichen Huang et.al. 2410.04819v1 link
2024-10-07 TLDR: Token-Level Detective Reward Model for Large Vision Language Models Deqing Fu et.al. 2410.04734v1 null
2024-10-06 LRQ-Fact: LLM-Generated Relevant Questions for Multimodal Fact-Checking Alimohammad Beigi et.al. 2410.04616v1 null
2024-10-06 CogDevelop2K: Reversed Cognitive Development in Multimodal Large Language Models Yijiang Li et.al. 2410.10855v1 null
2024-10-06 FAMMA: A Benchmark for Financial Domain Multilingual Multimodal Question Answering Siqiao Xue et.al. 2410.04526v2 null
2024-10-06 ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection Yibo Yan et.al. 2410.04509v2 null
2024-10-06 Fine-Grained Prediction of Reading Comprehension from Eye Movements Omer Shubi et.al. 2410.04484v1 link
2024-10-04 Unraveling Cross-Modality Knowledge Conflicts in Large Vision-Language Models Tinghui Zhu et.al. 2410.03659v2 link
2024-10-04 Self-Powered LLM Modality Expansion for Large Speech-Text Models Tengfei Yu et.al. 2410.03798v2 link
2024-10-03 Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos Jianrui Zhang et.al. 2410.02763v1 null
2024-10-03 Video Instruction Tuning With Synthetic Data Yuanhan Zhang et.al. 2410.02713v2 null
2024-10-03 LLaVA-Critic: Learning to Evaluate Multimodal Models Tianyi Xiong et.al. 2410.02712v1 null
2024-10-03 From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities Wanpeng Zhang et.al. 2410.02155v2 null
2024-10-02 Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks Mengzhao Jia et.al. 2410.01744v2 link
2024-10-01 BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data Xuwu Wang et.al. 2410.00773v1 link
2024-10-01 ERASMO: Leveraging Large Language Models for Enhanced Clustering Segmentation Fillipe dos Santos Silva et.al. 2410.03738v1 null
2024-09-30 Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning Weitai Kang et.al. 2410.00255v1 link
2024-09-30 MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning Haotian Zhang et.al. 2409.20566v1 null
2024-09-30 HELPD: Mitigating Hallucination of LVLMs by Hierarchical Feedback Learning with Vision-enhanced Penalty Decoding Fan Yuan et.al. 2409.20429v1 null
2024-09-30 Using Large Multimodal Models to Extract Knowledge Components for Knowledge Tracing from Multimedia Question Information Hyeongdon Moon et.al. 2409.20167v1 link
2024-09-30 Multimodal LLM Enhanced Cross-lingual Cross-modal Retrieval Yabing Wang et.al. 2409.19961v1 null