Pinned Repositories
ADE-dataset
A first dataset to determine the alignment relation between visual and textual elements on diagrams for diagram understanding. We will open the dataset and code after the paper is acceped.
InternLM-XComposer
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
TabRecSet
A large scale camera-taken table detection and recognition dataset.
ms-swift
Use PEFT or Full-parameter to finetune 400+ LLMs (Qwen2.5, Llama3.2, GLM4, Internlm2.5, Yi1.5, Mistral, Baichuan2, DeepSeek, ...) or 100+ MLLMs (Qwen2-VL, Qwen2-Audio, Llama3.2-Vision, Llava, InternVL2.5, MiniCPM-V-2.6, GLM4v, Xcomposer2.5, Yi-VL, DeepSeek-VL2, Phi3.5-Vision, GOT-OCR2, ...).
CuMo
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
surya
OCR, layout analysis, reading order, table recognition in 90+ languages
vqaonline.github.io
Marcovaldon's Repositories
Marcovaldon doesn’t have any repository yet.