arxiv-daily

Automated deployment @ 2024-10-22 09:44:43

Add your topics and keywords in database/topic.yml You can also view historical data through the database/storage

Mutimodal

Weakly Supervised grounding

Publish Date	Title	Authors	PDF	Code
2024-02-29	How to Understand "Support"? An Implicit-enhanced Causal Inference Approach for Weakly-supervised Phrase Grounding	Jiamin Luo et.al.	2402.19116v2	null
2024-01-19	Weakly Supervised Gaussian Contrastive Grounding with Large Multimodal Models for Video Question Answering	Haibo Wang et.al.	2401.10711v4	link
2023-12-15	Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment	Xiaoxu Xu et.al.	2312.09625v3	null
2023-12-07	Improved Visual Grounding through Self-Consistent Explanations	Ruozhen He et.al.	2312.04554v1	null
2023-05-18	Weakly-Supervised Visual-Textual Grounding with Semantic Prior Refinement	Davide Rigoni et.al.	2305.10913v2	link
2023-03-31	Zero-shot Referring Image Segmentation with Global-Local Context Features	Seonghoon Yu et.al.	2303.17811v2	link
2022-10-09	MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning	Zijia Zhao et.al.	2210.04183v3	null
2022-06-14	Beyond Grounding: Extracting Fine-Grained Event Hierarchies Across Modalities	Hammad A. Ayyubi et.al.	2206.07207v3	null
2022-04-22	Hypergraph Transformer: Weakly-supervised Multi-hop Reasoning for Knowledge-based Visual Question Answering	Yu-Jung Heo et.al.	2204.10448v1	link
2022-03-16	Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding	Haojun Jiang et.al.	2203.08481v2	link
2022-02-09	Can Open Domain Question Answering Systems Answer Visual Knowledge Questions?	Jiawen Zhang et.al.	2202.04306v1	null
2021-12-01	Weakly-Supervised Video Object Grounding via Causal Intervention	Wei Wang et.al.	2112.00475v1	null
2021-09-04	Weakly Supervised Relative Spatial Reasoning for Visual Question Answering	Pratyay Banerjee et.al.	2109.01934v1	null
2020-10-12	MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding	Qinxin Wang et.al.	2010.05379v1	link
2020-06-17	Contrastive Learning for Weakly Supervised Phrase Grounding	Tanmay Gupta et.al.	2006.09920v3	link
2019-12-01	Learning to Relate from Captions and Bounding Boxes	Sarthak Garg et.al.	1912.00311v1	null
2019-08-29	Aesthetic Image Captioning From Weakly-Labelled Photographs	Koustav Ghosal et.al.	1908.11310v1	null

Alignment

Publish Date	Title	Authors	PDF	Code
2024-10-18	MiCEval: Unveiling Multimodal Chain of Thought's Quality via Image Description and Reasoning Steps	Xiongtao Zhou et.al.	2410.14668v1	link
2024-10-18	Few-Shot Joint Multimodal Entity-Relation Extraction via Knowledge-Enhanced Cross-modal Prompt Model	Li Yuan et.al.	2410.14225v1	null
2024-10-17	Exploring the Design Space of Visual Context Representation in Video MLLMs	Yifan Du et.al.	2410.13694v1	link
2024-10-17	Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant	Haoran Hao et.al.	2410.13360v1	link
2024-10-17	CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models	Shangda Wu et.al.	2410.13267v1	link
2024-10-16	MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models	Peng Xia et.al.	2410.13085v1	link
2024-10-15	OMCAT: Omni Context Aware Transformer	Arushi Goel et.al.	2410.12109v1	null
2024-10-14	MMCFND: Multimodal Multilingual Caption-aware Fake News Detection for Low-resource Indic Languages	Shubhi Bansal et.al.	2410.10407v1	null
2024-10-11	Baichuan-Omni Technical Report	Yadong Li et.al.	2410.08565v1	link
2024-10-10	InstructBioMol: Advancing Biomolecule Understanding and Design Following Human Instructions	Xiang Zhuang et.al.	2410.07919v1	null
2024-10-09	Do better language models have crisper vision?	Jona Ruthardt et.al.	2410.07173v1	null
2024-10-09	ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time	Yi Ding et.al.	2410.06625v1	link
2024-10-08	LLaCA: Multimodal Large Language Continual Assistant	Jingyang Qiao et.al.	2410.10868v1	null
2024-10-08	Multimodal Large Language Models and Tunings: Vision, Language, Sensors, Audio, and Beyond	Soyeon Caren Han et.al.	2410.05608v1	null
2024-10-03	LLaVA-Critic: Learning to Evaluate Multimodal Models	Tianyi Xiong et.al.	2410.02712v1	null
2024-10-03	From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities	Wanpeng Zhang et.al.	2410.02155v2	null
2024-09-30	The age of spiritual machines: Language quietus induces synthetic altered states of consciousness in artificial intelligence	Jeremy I Skipper et.al.	2410.00257v1	null
2024-09-30	Multimodal LLM Enhanced Cross-lingual Cross-modal Retrieval	Yabing Wang et.al.	2409.19961v1	null
2024-09-29	A multimodal LLM for the non-invasive decoding of spoken text from brain recordings	Youssef Hmamouche et.al.	2409.19710v1	null
2024-09-27	Show and Guide: Instructional-Plan Grounded Vision and Language Model	Diogo Glória-Silva et.al.	2409.19074v3	link
2024-09-27	CLLMate: A Multimodal LLM for Weather and Climate Events Forecasting	Haobo Li et.al.	2409.19058v1	null
2024-09-26	MMMT-IF: A Challenging Multimodal Multi-Turn Instruction Following Benchmark	Elliot L. Epstein et.al.	2409.18216v1	null
2024-09-26	MIO: A Foundation Model on Multimodal Tokens	Zekun Wang et.al.	2409.17692v1	null
2024-09-26	ZALM3: Zero-Shot Enhancement of Vision-Language Alignment via In-Context Information in Multi-Turn Multimodal Medical Dialogue	Zhangpu Li et.al.	2409.17610v1	null
2024-09-24	M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning	Taowen Wang et.al.	2409.15657v3	link
2024-09-20	MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression Comprehension	Ting Liu et.al.	2409.13609v2	null
2024-09-20	AVG-LLaVA: A Large Multimodal Model with Adaptive Visual Granularity	Zhibin Lan et.al.	2410.02745v2	link
2024-09-20	ChemDFM-X: Towards Large Multimodal Model for Chemistry	Zihan Zhao et.al.	2409.13194v1	null
2024-09-18	Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution	Peng Wang et.al.	2409.12191v2	link
2024-09-17	CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional Calibration	Jiahui Gao et.al.	2409.11365v2	null
2024-09-16	Model-in-the-Loop (MILO): Accelerating Multimodal AI Data Annotation with LLMs	Yifan Wang et.al.	2409.10702v2	null
2024-09-16	Quantile Regression for Distributional Reward Models in RLHF	Nicolai Dorka et.al.	2409.10164v1	link
2024-09-14	Constructive Approach to Bidirectional Causation between Qualia Structure and Language Emergence	Tadahiro Taniguchi et.al.	2409.09413v1	null
2024-09-14	IW-Bench: Evaluating Large Multimodal Models for Converting Image-to-Web	Hongcheng Guo et.al.	2409.18980v1	null
2024-09-14	From Text to Multimodality: Exploring the Evolution and Impact of Large Language Models in Medical Practice	Qian Niu et.al.	2410.01812v2	null
2024-09-11	What to align in multimodal contrastive learning?	Benoit Dufumier et.al.	2409.07402v1	null
2024-09-09	MLLM-FL: Multimodal Large Language Model Assisted Federated Learning on Heterogeneous and Long-tailed Data	Jianyi Zhang et.al.	2409.06067v1	null
2024-09-05	ChartMoE: Mixture of Expert Connector for Advanced Chart Understanding	Zhengzhuo Xu et.al.	2409.03277v1	null
2024-08-30	MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models	Shuai Peng et.al.	2409.00147v1	link
2024-08-23	The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities	Venkatesh Balavadhani Parthasarathy et.al.	2408.13296v1	null
2024-08-23	IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal Capabilities	Bin Wang et.al.	2408.12902v1	link
2024-08-19	Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning	Sriyash Poddar et.al.	2408.10075v1	null
2024-08-16	Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning	Wenwen Zhuang et.al.	2408.08640v2	link
2024-08-13	CROME: Cross-Modal Adapters for Efficient Multimodal LLM	Sayna Ebrahimi et.al.	2408.06610v1	null
2024-08-11	HateSieve: A Contrastive Learning Framework for Detecting and Segmenting Hateful Content in Multimodal Memes	Xuanyu Su et.al.	2408.05794v1	null
2024-08-11	VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing	Chunyu Qiang et.al.	2408.05758v1	null
2024-08-09	VITA: Towards Open-Source Interactive Omni Multimodal LLM	Chaoyou Fu et.al.	2408.05211v2	link
2024-08-01	Mitigating Multilingual Hallucination in Large Vision-Language Models	Xiaoye Qu et.al.	2408.00550v1	link
2024-07-31	Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language Models	Yue Xu et.al.	2407.21659v4	link
2024-07-29	BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues	Sara Sarto et.al.	2407.20341v1	link
2024-07-28	LLAVADI: What Matters For Multimodal Large Language Models Distillation	Shilin Xu et.al.	2407.19409v1	null
2024-07-26	Creating an Aligned Corpus of Sound and Text: The Multimodal Corpus of Shakespeare and Milton	Manex Agirrezabal et.al.	2407.18730v1	null
2024-07-26	Every Part Matters: Integrity Verification of Scientific Figures Based on Multimodal Large Language Models	Xiang Shi et.al.	2407.18626v1	link

Computer Vision

OVD

Publish Date	Title	Authors	PDF	Code
2024-09-24	HA-FGOVD: Highlighting Fine-grained Attributes via Explicit Linear Composition for Open-Vocabulary Object Detection	Yuqi Ma et.al.	2409.16136v1	null
2024-04-03	ALOHa: A New Measure for Hallucination in Captioning Models	Suzanne Petryk et.al.	2404.02904v1	null
2024-03-21	Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection	Tim Salzmann et.al.	2403.14270v2	null
2024-03-11	Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head	Tiancheng Zhao et.al.	2403.06892v1	link
2023-08-25	How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection	Yiyang Yao et.al.	2308.13177v2	link
2023-05-11	Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers	Dahun Kim et.al.	2305.07011v4	link
2023-04-10	Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition	Shuhuai Ren et.al.	2304.04704v2	link
2023-03-29	MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks	Weicheng Kuo et.al.	2303.16839v3	null
2023-03-17	Investigating the Role of Attribute Context in Vision-Language Models for Object Recognition and Detection	Kyle Buettner et.al.	2303.10093v2	null
2022-09-10	OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network	Tiancheng Zhao et.al.	2209.05946v2	link
2022-06-12	GLIPv2: Unifying Localization and Vision-Language Understanding	Haotian Zhang et.al.	2206.05836v2	link

LMM

Publish Date	Title	Authors	PDF	Code
2024-10-18	MiCEval: Unveiling Multimodal Chain of Thought's Quality via Image Description and Reasoning Steps	Xiongtao Zhou et.al.	2410.14668v1	link
2024-10-18	Few-Shot Joint Multimodal Entity-Relation Extraction via Knowledge-Enhanced Cross-modal Prompt Model	Li Yuan et.al.	2410.14225v1	null
2024-10-18	MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart Problems	Zifeng Zhu et.al.	2410.14179v1	null
2024-10-18	Utilizing Large Language Models for Event Deconstruction to Enhance Multimodal Aspect-Based Sentiment Analysis	Xiaoyong Huang et.al.	2410.14150v1	null
2024-10-18	Coherence-Driven Multimodal Safety Dialogue with Active Learning for Embodied Agents	Sabit Hassan et.al.	2410.14141v1	null
2024-10-17	Generating Signed Language Instructions in Large-Scale Dialogue Systems	Mert İnan et.al.	2410.14026v1	null
2024-10-17	Can MLLMs Understand the Deep Implication Behind Chinese Images?	Chenhao Zhang et.al.	2410.13854v1	link
2024-10-17	Retrospective Learning from Interactions	Zizhao Chen et.al.	2410.13852v1	null
2024-10-17	Harnessing Webpage UIs for Text-Rich Visual Understanding	Junpeng Liu et.al.	2410.13824v2	null
2024-10-17	MobA: A Two-Level Agent System for Efficient Mobile Task Automation	Zichen Zhu et.al.	2410.13757v1	null
2024-10-17	Exploring the Design Space of Visual Context Representation in Video MLLMs	Yifan Du et.al.	2410.13694v1	link
2024-10-17	Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant	Haoran Hao et.al.	2410.13360v1	link
2024-10-17	Representation Learning of Structured Data for Medical Foundation Models	Vijay Prakash Dwivedi et.al.	2410.13351v1	null
2024-10-17	CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models	Shangda Wu et.al.	2410.13267v1	link
2024-10-16	MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models	Peng Xia et.al.	2410.13085v1	link
2024-10-16	WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation	João Matos et.al.	2410.12722v1	link
2024-10-16	Prompt Compression for Large Language Models: A Survey	Zongqian Li et.al.	2410.12388v2	link
2024-10-16	Understanding the Role of LLMs in Multimodal Evaluation Benchmarks	Botian Jiang et.al.	2410.12329v1	null
2024-10-15	OMCAT: Omni Context Aware Transformer	Arushi Goel et.al.	2410.12109v1	null
2024-10-15	MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation	Chenxi Wang et.al.	2410.11779v1	link
2024-10-15	Magnifier Prompt: Tackling Multimodal Hallucination via Extremely Simple Instructions	Yuhan Fu et.al.	2410.11701v1	null
2024-10-15	VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI	Sijie Cheng et.al.	2410.11623v1	null
2024-10-15	MultiVENT 2.0: A Massive Multilingual Benchmark for Event-Centric Video Retrieval	Reno Kriz et.al.	2410.11619v1	null
2024-10-15	Difficult Task Yes but Simple Task No: Unveiling the Laziness in Multimodal LLMs	Sihang Zhao et.al.	2410.11437v1	link
2024-10-14	Generative AI and Its Impact on Personalized Intelligent Tutoring Systems	Subhankar Maity et.al.	2410.10650v1	null
2024-10-14	MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models	Peng Xia et.al.	2410.10139v1	link
2024-10-13	Empowering Dysarthric Speech: Leveraging Advanced LLMs for Accurate Speech Correction and Multimodal Emotion Analysis	Kaushal Attaluri et.al.	2410.12867v1	null
2024-10-13	BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models	Xinyuan Wang et.al.	2410.09804v2	null
2024-10-13	ECIS-VQG: Generation of Entity-centric Information-seeking Questions from Videos	Arpan Phukan et.al.	2410.09776v1	link
2024-10-12	Reconstructive Visual Instruction Tuning	Haochen Wang et.al.	2410.09575v1	null
2024-10-12	Declarative Knowledge Distillation from Large Language Models for Visual Question Answering Datasets	Thomas Eiter et.al.	2410.09428v1	link
2024-10-11	M3Hop-CoT: Misogynous Meme Identification with Multimodal Multi-hop Chain-of-Thought	Gitanjali Kumari et.al.	2410.09220v1	link
2024-10-11	A Social Context-aware Graph-based Multimodal Attentive Learning Framework for Disaster Content Classification during Emergencies	Shahid Shafi Dar et.al.	2410.08814v1	null
2024-10-11	Baichuan-Omni Technical Report	Yadong Li et.al.	2410.08565v1	link
2024-10-11	SPORTU: A Comprehensive Sports Understanding Benchmark for Multimodal Large Language Models	Haotian Xia et.al.	2410.08474v2	null
2024-10-10	LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts	Anh-Quan Cao et.al.	2410.08211v1	null
2024-10-10	Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training	Gen Luo et.al.	2410.08202v1	null
2024-10-10	MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models	Wenbo Hu et.al.	2410.08182v1	null
2024-10-10	Sample then Identify: A General Framework for Risk Control and Assessment in Multimodal Large Language Models	Qingni Wang et.al.	2410.08174v1	null
2024-10-10	Agent S: An Open Agentic Framework that Uses Computers Like a Human	Saaket Agashe et.al.	2410.08164v1	link
2024-10-10	Insight Over Sight? Exploring the Vision-Knowledge Conflicts in Multimodal LLMs	Xiaoyuan Liu et.al.	2410.08145v1	null
2024-10-10	InstructBioMol: Advancing Biomolecule Understanding and Design Following Human Instructions	Xiang Zhuang et.al.	2410.07919v1	null
2024-10-10	How Does Vision-Language Adaptation Impact the Safety of Vision Language Models?	Seongyun Lee et.al.	2410.07571v1	null
2024-10-10	Thought2Text: Text Generation from EEG Signal using Large Language Models (LLMs)	Abhijit Mishra et.al.	2410.07507v1	link
2024-10-09	Do better language models have crisper vision?	Jona Ruthardt et.al.	2410.07173v1	null
2024-10-09	To Preserve or To Compress: An In-Depth Study of Connector Selection in Multimodal Large Language Models	Junyan Lin et.al.	2410.06765v1	link
2024-10-09	Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization	Changli Tang et.al.	2410.06682v2	null
2024-10-09	ING-VP: MLLMs cannot Play Easy Vision-based Games Yet	Haoran Zhang et.al.	2410.06555v1	link
2024-10-09	Chip-Tuning: Classify Before Language Models Say	Fangwei Zhu et.al.	2410.06541v2	link
2024-10-08	Multimodal Situational Safety	Kaiwen Zhou et.al.	2410.06172v1	null
2024-10-08	PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling	Xudong Xie et.al.	2410.05970v1	link
2024-10-08	LLaCA: Multimodal Large Language Continual Assistant	Jingyang Qiao et.al.	2410.10868v1	null
2024-10-08	Multimodal Large Language Models and Tunings: Vision, Language, Sensors, Audio, and Beyond	Soyeon Caren Han et.al.	2410.05608v1	null
2024-10-07	Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents	Boyu Gou et.al.	2410.05243v1	link
2024-10-07	MINER: Mining the Underlying Pattern of Modality-Specific Neurons in Multimodal Large Language Models	Kaichen Huang et.al.	2410.04819v1	link
2024-10-07	TLDR: Token-Level Detective Reward Model for Large Vision Language Models	Deqing Fu et.al.	2410.04734v1	null
2024-10-06	LRQ-Fact: LLM-Generated Relevant Questions for Multimodal Fact-Checking	Alimohammad Beigi et.al.	2410.04616v1	null
2024-10-06	CogDevelop2K: Reversed Cognitive Development in Multimodal Large Language Models	Yijiang Li et.al.	2410.10855v1	null
2024-10-06	FAMMA: A Benchmark for Financial Domain Multilingual Multimodal Question Answering	Siqiao Xue et.al.	2410.04526v2	null
2024-10-06	ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection	Yibo Yan et.al.	2410.04509v2	null
2024-10-06	Fine-Grained Prediction of Reading Comprehension from Eye Movements	Omer Shubi et.al.	2410.04484v1	link
2024-10-04	Unraveling Cross-Modality Knowledge Conflicts in Large Vision-Language Models	Tinghui Zhu et.al.	2410.03659v2	link
2024-10-04	Self-Powered LLM Modality Expansion for Large Speech-Text Models	Tengfei Yu et.al.	2410.03798v2	link
2024-10-03	Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos	Jianrui Zhang et.al.	2410.02763v1	null
2024-10-03	Video Instruction Tuning With Synthetic Data	Yuanhan Zhang et.al.	2410.02713v2	null
2024-10-03	LLaVA-Critic: Learning to Evaluate Multimodal Models	Tianyi Xiong et.al.	2410.02712v1	null
2024-10-03	From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities	Wanpeng Zhang et.al.	2410.02155v2	null
2024-10-02	Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks	Mengzhao Jia et.al.	2410.01744v2	link
2024-10-01	BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data	Xuwu Wang et.al.	2410.00773v1	link
2024-10-01	ERASMO: Leveraging Large Language Models for Enhanced Clustering Segmentation	Fillipe dos Santos Silva et.al.	2410.03738v1	null
2024-09-30	Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning	Weitai Kang et.al.	2410.00255v1	link
2024-09-30	MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning	Haotian Zhang et.al.	2409.20566v1	null
2024-09-30	HELPD: Mitigating Hallucination of LVLMs by Hierarchical Feedback Learning with Vision-enhanced Penalty Decoding	Fan Yuan et.al.	2409.20429v1	null
2024-09-30	Using Large Multimodal Models to Extract Knowledge Components for Knowledge Tracing from Multimedia Question Information	Hyeongdon Moon et.al.	2409.20167v1	link
2024-09-30	Multimodal LLM Enhanced Cross-lingual Cross-modal Retrieval	Yabing Wang et.al.	2409.19961v1	null

Kuangdd01/arxiv-daily

arxiv-daily

Mutimodal

Weakly Supervised grounding

Alignment

Computer Vision

OVD

LMM