may0324's Stars
wangwen-whu/WTW-Dataset
This is an official implementation for the WTW Dataset in "Parsing Table Structures in the Wild " on table detection and table structure recognition.
harrytea/Awesome-Document-Understanding
Document Artifical Intelligence
cv-small-snails/Awesome-Table-Recognition
A curated list of resources dedicated to table recognition
ucaslcl/Fox
official code for "Fox: Focus Anywhere for Fine-grained Multi-page Document Understanding"
llavar/MMR_Bench
Ucas-HaoranWei/GOT-OCR2.0
Official code implementation of General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
snap-research/Panda-70M
[CVPR 2024] Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Sanster/xy-cut
Yuliang-Liu/MultimodalOCR
On the Hidden Mystery of OCR in Large Multimodal Models (OCRBench)
modelscope/ms-swift
Use PEFT or Full-parameter to finetune 400+ LLMs or 100+ MLLMs. (LLM: Qwen2.5, Llama3.2, GLM4, Internlm2.5, Yi1.5, Mistral, Baichuan2, DeepSeek, Gemma2, ...; MLLM: Qwen2-VL, Qwen2-Audio, Llama3.2-Vision, Llava, InternVL2, MiniCPM-V-2.6, GLM4v, Xcomposer2.5, Yi-VL, DeepSeek-VL, Phi3.5-Vision, ...)
X-PLUG/mPLUG-DocOwl
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
X-PLUG/mPLUG-Owl
mPLUG-Owl: The Powerful Multi-modal Large Language Model Family
Breakthrough/PySceneDetect
:movie_camera: Python and OpenCV-based scene cut/transition detection program & library.
mst272/LLM-Dojo
欢迎来到 LLM-Dojo,这里是一个开源大模型学习场所,使用简洁且易阅读的代码构建模型训练框架(支持各种主流模型如Qwen、Llama、GLM等等)、RLHF框架(DPO/CPO/KTO/PPO)等各种功能。👩🎓👨🎓
meta-llama/llama
Inference code for Llama models
bitsandbytes-foundation/bitsandbytes
Accessible large language models via k-bit quantization for PyTorch.
Vision-CAIR/MiniGPT4-video
Official code for Goldfish model for long video understanding and MiniGPT4-video for short video understanding
Vision-CAIR/MiniGPT-4
Open-sourced codes for MiniGPT-4 and MiniGPT-v2 (https://minigpt-4.github.io, https://minigpt-v2.github.io/)
IDEA-Research/Grounded-Segment-Anything
Grounded SAM: Marrying Grounding DINO with Segment Anything & Stable Diffusion & Recognize Anything - Automatically Detect , Segment and Generate Anything
IDEA-Research/GroundingDINO
[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
showlab/VLog
Transform Video as a Document with ChatGPT, CLIP, BLIP2, GRIT, Whisper, LangChain.
showlab/UniVTG
[ICCV2023] UniVTG: Towards Unified Video-Language Temporal Grounding
Albertsr/Anomaly-Detection
UnSupervised and Semi-Supervise Anomaly Detection / IsolationForest / KernelPCA Detection / ADOA / etc.
comfyanonymous/ComfyUI
The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface.
weijiawu/BOVText-Benchmark
[NeurIPS2021] BOVText: A Large-Scale, Multidimensional Multilingual Dataset for Video Text Spotting
weijiawu/TransVTSpotter
A new video text spotting framework with Transformer
LukeForeverYoung/UReader
TencentARC/VTLayout
yuxie11/R2D2
salesforce/LAVIS
LAVIS - A One-stop Library for Language-Vision Intelligence