Automated deployment @ 2024-10-22 09:44:43
Add your topics and keywords in
database/topic.yml
You can also view historical data through thedatabase/storage
Publish Date | Title | Authors | Code | |
---|---|---|---|---|
2024-02-29 | How to Understand "Support"? An Implicit-enhanced Causal Inference Approach for Weakly-supervised Phrase Grounding | Jiamin Luo et.al. | 2402.19116v2 | null |
2024-01-19 | Weakly Supervised Gaussian Contrastive Grounding with Large Multimodal Models for Video Question Answering | Haibo Wang et.al. | 2401.10711v4 | link |
2023-12-15 | Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment | Xiaoxu Xu et.al. | 2312.09625v3 | null |
2023-12-07 | Improved Visual Grounding through Self-Consistent Explanations | Ruozhen He et.al. | 2312.04554v1 | null |
2023-05-18 | Weakly-Supervised Visual-Textual Grounding with Semantic Prior Refinement | Davide Rigoni et.al. | 2305.10913v2 | link |
2023-03-31 | Zero-shot Referring Image Segmentation with Global-Local Context Features | Seonghoon Yu et.al. | 2303.17811v2 | link |
2022-10-09 | MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning | Zijia Zhao et.al. | 2210.04183v3 | null |
2022-06-14 | Beyond Grounding: Extracting Fine-Grained Event Hierarchies Across Modalities | Hammad A. Ayyubi et.al. | 2206.07207v3 | null |
2022-04-22 | Hypergraph Transformer: Weakly-supervised Multi-hop Reasoning for Knowledge-based Visual Question Answering | Yu-Jung Heo et.al. | 2204.10448v1 | link |
2022-03-16 | Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding | Haojun Jiang et.al. | 2203.08481v2 | link |
2022-02-09 | Can Open Domain Question Answering Systems Answer Visual Knowledge Questions? | Jiawen Zhang et.al. | 2202.04306v1 | null |
2021-12-01 | Weakly-Supervised Video Object Grounding via Causal Intervention | Wei Wang et.al. | 2112.00475v1 | null |
2021-09-04 | Weakly Supervised Relative Spatial Reasoning for Visual Question Answering | Pratyay Banerjee et.al. | 2109.01934v1 | null |
2020-10-12 | MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding | Qinxin Wang et.al. | 2010.05379v1 | link |
2020-06-17 | Contrastive Learning for Weakly Supervised Phrase Grounding | Tanmay Gupta et.al. | 2006.09920v3 | link |
2019-12-01 | Learning to Relate from Captions and Bounding Boxes | Sarthak Garg et.al. | 1912.00311v1 | null |
2019-08-29 | Aesthetic Image Captioning From Weakly-Labelled Photographs | Koustav Ghosal et.al. | 1908.11310v1 | null |
Publish Date | Title | Authors | Code | |
---|---|---|---|---|
2024-10-18 | MiCEval: Unveiling Multimodal Chain of Thought's Quality via Image Description and Reasoning Steps | Xiongtao Zhou et.al. | 2410.14668v1 | link |
2024-10-18 | Few-Shot Joint Multimodal Entity-Relation Extraction via Knowledge-Enhanced Cross-modal Prompt Model | Li Yuan et.al. | 2410.14225v1 | null |
2024-10-17 | Exploring the Design Space of Visual Context Representation in Video MLLMs | Yifan Du et.al. | 2410.13694v1 | link |
2024-10-17 | Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant | Haoran Hao et.al. | 2410.13360v1 | link |
2024-10-17 | CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models | Shangda Wu et.al. | 2410.13267v1 | link |
2024-10-16 | MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models | Peng Xia et.al. | 2410.13085v1 | link |
2024-10-15 | OMCAT: Omni Context Aware Transformer | Arushi Goel et.al. | 2410.12109v1 | null |
2024-10-14 | MMCFND: Multimodal Multilingual Caption-aware Fake News Detection for Low-resource Indic Languages | Shubhi Bansal et.al. | 2410.10407v1 | null |
2024-10-11 | Baichuan-Omni Technical Report | Yadong Li et.al. | 2410.08565v1 | link |
2024-10-10 | InstructBioMol: Advancing Biomolecule Understanding and Design Following Human Instructions | Xiang Zhuang et.al. | 2410.07919v1 | null |
2024-10-09 | Do better language models have crisper vision? | Jona Ruthardt et.al. | 2410.07173v1 | null |
2024-10-09 | ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time | Yi Ding et.al. | 2410.06625v1 | link |
2024-10-08 | LLaCA: Multimodal Large Language Continual Assistant | Jingyang Qiao et.al. | 2410.10868v1 | null |
2024-10-08 | Multimodal Large Language Models and Tunings: Vision, Language, Sensors, Audio, and Beyond | Soyeon Caren Han et.al. | 2410.05608v1 | null |
2024-10-03 | LLaVA-Critic: Learning to Evaluate Multimodal Models | Tianyi Xiong et.al. | 2410.02712v1 | null |
2024-10-03 | From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities | Wanpeng Zhang et.al. | 2410.02155v2 | null |
2024-09-30 | The age of spiritual machines: Language quietus induces synthetic altered states of consciousness in artificial intelligence | Jeremy I Skipper et.al. | 2410.00257v1 | null |
2024-09-30 | Multimodal LLM Enhanced Cross-lingual Cross-modal Retrieval | Yabing Wang et.al. | 2409.19961v1 | null |
2024-09-29 | A multimodal LLM for the non-invasive decoding of spoken text from brain recordings | Youssef Hmamouche et.al. | 2409.19710v1 | null |
2024-09-27 | Show and Guide: Instructional-Plan Grounded Vision and Language Model | Diogo Glória-Silva et.al. | 2409.19074v3 | link |
2024-09-27 | CLLMate: A Multimodal LLM for Weather and Climate Events Forecasting | Haobo Li et.al. | 2409.19058v1 | null |
2024-09-26 | MMMT-IF: A Challenging Multimodal Multi-Turn Instruction Following Benchmark | Elliot L. Epstein et.al. | 2409.18216v1 | null |
2024-09-26 | MIO: A Foundation Model on Multimodal Tokens | Zekun Wang et.al. | 2409.17692v1 | null |
2024-09-26 | ZALM3: Zero-Shot Enhancement of Vision-Language Alignment via In-Context Information in Multi-Turn Multimodal Medical Dialogue | Zhangpu Li et.al. | 2409.17610v1 | null |
2024-09-24 | M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning | Taowen Wang et.al. | 2409.15657v3 | link |
2024-09-20 | MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression Comprehension | Ting Liu et.al. | 2409.13609v2 | null |
2024-09-20 | AVG-LLaVA: A Large Multimodal Model with Adaptive Visual Granularity | Zhibin Lan et.al. | 2410.02745v2 | link |
2024-09-20 | ChemDFM-X: Towards Large Multimodal Model for Chemistry | Zihan Zhao et.al. | 2409.13194v1 | null |
2024-09-18 | Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution | Peng Wang et.al. | 2409.12191v2 | link |
2024-09-17 | CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional Calibration | Jiahui Gao et.al. | 2409.11365v2 | null |
2024-09-16 | Model-in-the-Loop (MILO): Accelerating Multimodal AI Data Annotation with LLMs | Yifan Wang et.al. | 2409.10702v2 | null |
2024-09-16 | Quantile Regression for Distributional Reward Models in RLHF | Nicolai Dorka et.al. | 2409.10164v1 | link |
2024-09-14 | Constructive Approach to Bidirectional Causation between Qualia Structure and Language Emergence | Tadahiro Taniguchi et.al. | 2409.09413v1 | null |
2024-09-14 | IW-Bench: Evaluating Large Multimodal Models for Converting Image-to-Web | Hongcheng Guo et.al. | 2409.18980v1 | null |
2024-09-14 | From Text to Multimodality: Exploring the Evolution and Impact of Large Language Models in Medical Practice | Qian Niu et.al. | 2410.01812v2 | null |
2024-09-11 | What to align in multimodal contrastive learning? | Benoit Dufumier et.al. | 2409.07402v1 | null |
2024-09-09 | MLLM-FL: Multimodal Large Language Model Assisted Federated Learning on Heterogeneous and Long-tailed Data | Jianyi Zhang et.al. | 2409.06067v1 | null |
2024-09-05 | ChartMoE: Mixture of Expert Connector for Advanced Chart Understanding | Zhengzhuo Xu et.al. | 2409.03277v1 | null |
2024-08-30 | MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models | Shuai Peng et.al. | 2409.00147v1 | link |
2024-08-23 | The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities | Venkatesh Balavadhani Parthasarathy et.al. | 2408.13296v1 | null |
2024-08-23 | IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal Capabilities | Bin Wang et.al. | 2408.12902v1 | link |
2024-08-19 | Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning | Sriyash Poddar et.al. | 2408.10075v1 | null |
2024-08-16 | Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning | Wenwen Zhuang et.al. | 2408.08640v2 | link |
2024-08-13 | CROME: Cross-Modal Adapters for Efficient Multimodal LLM | Sayna Ebrahimi et.al. | 2408.06610v1 | null |
2024-08-11 | HateSieve: A Contrastive Learning Framework for Detecting and Segmenting Hateful Content in Multimodal Memes | Xuanyu Su et.al. | 2408.05794v1 | null |
2024-08-11 | VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing | Chunyu Qiang et.al. | 2408.05758v1 | null |
2024-08-09 | VITA: Towards Open-Source Interactive Omni Multimodal LLM | Chaoyou Fu et.al. | 2408.05211v2 | link |
2024-08-01 | Mitigating Multilingual Hallucination in Large Vision-Language Models | Xiaoye Qu et.al. | 2408.00550v1 | link |
2024-07-31 | Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language Models | Yue Xu et.al. | 2407.21659v4 | link |
2024-07-29 | BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues | Sara Sarto et.al. | 2407.20341v1 | link |
2024-07-28 | LLAVADI: What Matters For Multimodal Large Language Models Distillation | Shilin Xu et.al. | 2407.19409v1 | null |
2024-07-26 | Creating an Aligned Corpus of Sound and Text: The Multimodal Corpus of Shakespeare and Milton | Manex Agirrezabal et.al. | 2407.18730v1 | null |
2024-07-26 | Every Part Matters: Integrity Verification of Scientific Figures Based on Multimodal Large Language Models | Xiang Shi et.al. | 2407.18626v1 | link |
Publish Date | Title | Authors | Code | |
---|---|---|---|---|
2024-09-24 | HA-FGOVD: Highlighting Fine-grained Attributes via Explicit Linear Composition for Open-Vocabulary Object Detection | Yuqi Ma et.al. | 2409.16136v1 | null |
2024-04-03 | ALOHa: A New Measure for Hallucination in Captioning Models | Suzanne Petryk et.al. | 2404.02904v1 | null |
2024-03-21 | Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection | Tim Salzmann et.al. | 2403.14270v2 | null |
2024-03-11 | Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head | Tiancheng Zhao et.al. | 2403.06892v1 | link |
2023-08-25 | How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection | Yiyang Yao et.al. | 2308.13177v2 | link |
2023-05-11 | Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers | Dahun Kim et.al. | 2305.07011v4 | link |
2023-04-10 | Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition | Shuhuai Ren et.al. | 2304.04704v2 | link |
2023-03-29 | MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks | Weicheng Kuo et.al. | 2303.16839v3 | null |
2023-03-17 | Investigating the Role of Attribute Context in Vision-Language Models for Object Recognition and Detection | Kyle Buettner et.al. | 2303.10093v2 | null |
2022-09-10 | OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network | Tiancheng Zhao et.al. | 2209.05946v2 | link |
2022-06-12 | GLIPv2: Unifying Localization and Vision-Language Understanding | Haotian Zhang et.al. | 2206.05836v2 | link |
Publish Date | Title | Authors | Code | |
---|---|---|---|---|
2024-10-18 | MiCEval: Unveiling Multimodal Chain of Thought's Quality via Image Description and Reasoning Steps | Xiongtao Zhou et.al. | 2410.14668v1 | link |
2024-10-18 | Few-Shot Joint Multimodal Entity-Relation Extraction via Knowledge-Enhanced Cross-modal Prompt Model | Li Yuan et.al. | 2410.14225v1 | null |
2024-10-18 | MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart Problems | Zifeng Zhu et.al. | 2410.14179v1 | null |
2024-10-18 | Utilizing Large Language Models for Event Deconstruction to Enhance Multimodal Aspect-Based Sentiment Analysis | Xiaoyong Huang et.al. | 2410.14150v1 | null |
2024-10-18 | Coherence-Driven Multimodal Safety Dialogue with Active Learning for Embodied Agents | Sabit Hassan et.al. | 2410.14141v1 | null |
2024-10-17 | Generating Signed Language Instructions in Large-Scale Dialogue Systems | Mert İnan et.al. | 2410.14026v1 | null |
2024-10-17 | Can MLLMs Understand the Deep Implication Behind Chinese Images? | Chenhao Zhang et.al. | 2410.13854v1 | link |
2024-10-17 | Retrospective Learning from Interactions | Zizhao Chen et.al. | 2410.13852v1 | null |
2024-10-17 | Harnessing Webpage UIs for Text-Rich Visual Understanding | Junpeng Liu et.al. | 2410.13824v2 | null |
2024-10-17 | MobA: A Two-Level Agent System for Efficient Mobile Task Automation | Zichen Zhu et.al. | 2410.13757v1 | null |
2024-10-17 | Exploring the Design Space of Visual Context Representation in Video MLLMs | Yifan Du et.al. | 2410.13694v1 | link |
2024-10-17 | Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant | Haoran Hao et.al. | 2410.13360v1 | link |
2024-10-17 | Representation Learning of Structured Data for Medical Foundation Models | Vijay Prakash Dwivedi et.al. | 2410.13351v1 | null |
2024-10-17 | CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models | Shangda Wu et.al. | 2410.13267v1 | link |
2024-10-16 | MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models | Peng Xia et.al. | 2410.13085v1 | link |
2024-10-16 | WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation | João Matos et.al. | 2410.12722v1 | link |
2024-10-16 | Prompt Compression for Large Language Models: A Survey | Zongqian Li et.al. | 2410.12388v2 | link |
2024-10-16 | Understanding the Role of LLMs in Multimodal Evaluation Benchmarks | Botian Jiang et.al. | 2410.12329v1 | null |
2024-10-15 | OMCAT: Omni Context Aware Transformer | Arushi Goel et.al. | 2410.12109v1 | null |
2024-10-15 | MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation | Chenxi Wang et.al. | 2410.11779v1 | link |
2024-10-15 | Magnifier Prompt: Tackling Multimodal Hallucination via Extremely Simple Instructions | Yuhan Fu et.al. | 2410.11701v1 | null |
2024-10-15 | VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI | Sijie Cheng et.al. | 2410.11623v1 | null |
2024-10-15 | MultiVENT 2.0: A Massive Multilingual Benchmark for Event-Centric Video Retrieval | Reno Kriz et.al. | 2410.11619v1 | null |
2024-10-15 | Difficult Task Yes but Simple Task No: Unveiling the Laziness in Multimodal LLMs | Sihang Zhao et.al. | 2410.11437v1 | link |
2024-10-14 | Generative AI and Its Impact on Personalized Intelligent Tutoring Systems | Subhankar Maity et.al. | 2410.10650v1 | null |
2024-10-14 | MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models | Peng Xia et.al. | 2410.10139v1 | link |
2024-10-13 | Empowering Dysarthric Speech: Leveraging Advanced LLMs for Accurate Speech Correction and Multimodal Emotion Analysis | Kaushal Attaluri et.al. | 2410.12867v1 | null |
2024-10-13 | BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models | Xinyuan Wang et.al. | 2410.09804v2 | null |
2024-10-13 | ECIS-VQG: Generation of Entity-centric Information-seeking Questions from Videos | Arpan Phukan et.al. | 2410.09776v1 | link |
2024-10-12 | Reconstructive Visual Instruction Tuning | Haochen Wang et.al. | 2410.09575v1 | null |
2024-10-12 | Declarative Knowledge Distillation from Large Language Models for Visual Question Answering Datasets | Thomas Eiter et.al. | 2410.09428v1 | link |
2024-10-11 | M3Hop-CoT: Misogynous Meme Identification with Multimodal Multi-hop Chain-of-Thought | Gitanjali Kumari et.al. | 2410.09220v1 | link |
2024-10-11 | A Social Context-aware Graph-based Multimodal Attentive Learning Framework for Disaster Content Classification during Emergencies | Shahid Shafi Dar et.al. | 2410.08814v1 | null |
2024-10-11 | Baichuan-Omni Technical Report | Yadong Li et.al. | 2410.08565v1 | link |
2024-10-11 | SPORTU: A Comprehensive Sports Understanding Benchmark for Multimodal Large Language Models | Haotian Xia et.al. | 2410.08474v2 | null |
2024-10-10 | LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts | Anh-Quan Cao et.al. | 2410.08211v1 | null |
2024-10-10 | Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training | Gen Luo et.al. | 2410.08202v1 | null |
2024-10-10 | MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models | Wenbo Hu et.al. | 2410.08182v1 | null |
2024-10-10 | Sample then Identify: A General Framework for Risk Control and Assessment in Multimodal Large Language Models | Qingni Wang et.al. | 2410.08174v1 | null |
2024-10-10 | Agent S: An Open Agentic Framework that Uses Computers Like a Human | Saaket Agashe et.al. | 2410.08164v1 | link |
2024-10-10 | Insight Over Sight? Exploring the Vision-Knowledge Conflicts in Multimodal LLMs | Xiaoyuan Liu et.al. | 2410.08145v1 | null |
2024-10-10 | InstructBioMol: Advancing Biomolecule Understanding and Design Following Human Instructions | Xiang Zhuang et.al. | 2410.07919v1 | null |
2024-10-10 | How Does Vision-Language Adaptation Impact the Safety of Vision Language Models? | Seongyun Lee et.al. | 2410.07571v1 | null |
2024-10-10 | Thought2Text: Text Generation from EEG Signal using Large Language Models (LLMs) | Abhijit Mishra et.al. | 2410.07507v1 | link |
2024-10-09 | Do better language models have crisper vision? | Jona Ruthardt et.al. | 2410.07173v1 | null |
2024-10-09 | To Preserve or To Compress: An In-Depth Study of Connector Selection in Multimodal Large Language Models | Junyan Lin et.al. | 2410.06765v1 | link |
2024-10-09 | Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization | Changli Tang et.al. | 2410.06682v2 | null |
2024-10-09 | ING-VP: MLLMs cannot Play Easy Vision-based Games Yet | Haoran Zhang et.al. | 2410.06555v1 | link |
2024-10-09 | Chip-Tuning: Classify Before Language Models Say | Fangwei Zhu et.al. | 2410.06541v2 | link |
2024-10-08 | Multimodal Situational Safety | Kaiwen Zhou et.al. | 2410.06172v1 | null |
2024-10-08 | PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling | Xudong Xie et.al. | 2410.05970v1 | link |
2024-10-08 | LLaCA: Multimodal Large Language Continual Assistant | Jingyang Qiao et.al. | 2410.10868v1 | null |
2024-10-08 | Multimodal Large Language Models and Tunings: Vision, Language, Sensors, Audio, and Beyond | Soyeon Caren Han et.al. | 2410.05608v1 | null |
2024-10-07 | Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents | Boyu Gou et.al. | 2410.05243v1 | link |
2024-10-07 | MINER: Mining the Underlying Pattern of Modality-Specific Neurons in Multimodal Large Language Models | Kaichen Huang et.al. | 2410.04819v1 | link |
2024-10-07 | TLDR: Token-Level Detective Reward Model for Large Vision Language Models | Deqing Fu et.al. | 2410.04734v1 | null |
2024-10-06 | LRQ-Fact: LLM-Generated Relevant Questions for Multimodal Fact-Checking | Alimohammad Beigi et.al. | 2410.04616v1 | null |
2024-10-06 | CogDevelop2K: Reversed Cognitive Development in Multimodal Large Language Models | Yijiang Li et.al. | 2410.10855v1 | null |
2024-10-06 | FAMMA: A Benchmark for Financial Domain Multilingual Multimodal Question Answering | Siqiao Xue et.al. | 2410.04526v2 | null |
2024-10-06 | ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection | Yibo Yan et.al. | 2410.04509v2 | null |
2024-10-06 | Fine-Grained Prediction of Reading Comprehension from Eye Movements | Omer Shubi et.al. | 2410.04484v1 | link |
2024-10-04 | Unraveling Cross-Modality Knowledge Conflicts in Large Vision-Language Models | Tinghui Zhu et.al. | 2410.03659v2 | link |
2024-10-04 | Self-Powered LLM Modality Expansion for Large Speech-Text Models | Tengfei Yu et.al. | 2410.03798v2 | link |
2024-10-03 | Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos | Jianrui Zhang et.al. | 2410.02763v1 | null |
2024-10-03 | Video Instruction Tuning With Synthetic Data | Yuanhan Zhang et.al. | 2410.02713v2 | null |
2024-10-03 | LLaVA-Critic: Learning to Evaluate Multimodal Models | Tianyi Xiong et.al. | 2410.02712v1 | null |
2024-10-03 | From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities | Wanpeng Zhang et.al. | 2410.02155v2 | null |
2024-10-02 | Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks | Mengzhao Jia et.al. | 2410.01744v2 | link |
2024-10-01 | BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data | Xuwu Wang et.al. | 2410.00773v1 | link |
2024-10-01 | ERASMO: Leveraging Large Language Models for Enhanced Clustering Segmentation | Fillipe dos Santos Silva et.al. | 2410.03738v1 | null |
2024-09-30 | Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning | Weitai Kang et.al. | 2410.00255v1 | link |
2024-09-30 | MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning | Haotian Zhang et.al. | 2409.20566v1 | null |
2024-09-30 | HELPD: Mitigating Hallucination of LVLMs by Hierarchical Feedback Learning with Vision-enhanced Penalty Decoding | Fan Yuan et.al. | 2409.20429v1 | null |
2024-09-30 | Using Large Multimodal Models to Extract Knowledge Components for Knowledge Tracing from Multimedia Question Information | Hyeongdon Moon et.al. | 2409.20167v1 | link |
2024-09-30 | Multimodal LLM Enhanced Cross-lingual Cross-modal Retrieval | Yabing Wang et.al. | 2409.19961v1 | null |