OpenDataLab

OpenDataLab provides access to numerous significant open-source datasets.

China

Pinned Repositories

DocLayout-YOLO
DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception
Language:Python1.8k 10 168135
LabelLLM
The Open-Source Data Annotation Platform
Language:TypeScript954 12 44105
labelU
Data annotation toolbox supports image, audio and video data.
Language:Python1.4k 17 89154
magic-doc
Language:Python543 11 3646
magic-html
Language:Python495 9 1942
MinerU
Transforms complex documents like PDFs into LLM-ready markdown/JSON for your Agentic workflows.
Language:Python48.3k 199 1.9k4k
OmniDocBench
[CVPR 2025] A Comprehensive Benchmark for Document Parsing and Evaluation
Language:Python1.1k 12 120109
PDF-Extract-Kit
A Comprehensive Toolkit for High-Quality PDF Content Extraction
Language:Python8.9k 58 176671
UniMERNet
UniMERNet: A Universal Network for Real-World Mathematical Expression Recognition
Language:Python424 8 6733
WanJuan1.0
万卷1.0多模态语料
567 7 2928

OpenDataLab's Repositories

opendatalab/MinerU
Transforms complex documents like PDFs into LLM-ready markdown/JSON for your Agentic workflows.
Language:Python48.3k 199 1.9k4k
opendatalab/DocLayout-YOLO
DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception
Language:Python1.8k 10 168135
opendatalab/labelU
Data annotation toolbox supports image, audio and video data.
Language:Python1.4k 17 89154
opendatalab/OmniDocBench
[CVPR 2025] A Comprehensive Benchmark for Document Parsing and Evaluation
Language:Python1.1k 12 120109
opendatalab/LabelLLM
The Open-Source Data Annotation Platform
Language:TypeScript954 12 44105
opendatalab/magic-html
Language:Python495 9 1942
opendatalab/UniMERNet
UniMERNet: A Universal Network for Real-World Mathematical Expression Recognition
Language:Python424 8 6733
opendatalab/Meta-rater
[ACL 2025 Best Theme Paper] This is the official implementation for the paper: "Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models"
Language:Python18014
opendatalab/LOKI
[ICLR 2025 Spotlight] The official implementation of the paper “LOKI：A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models”
Language:Python169 2 14
opendatalab/labelU-Kit
Data annotation component library --provided as NPM packages
Language:TypeScript133 7 1843
opendatalab/opendatalab-datasets
datasets resource
125 2 312
opendatalab/VHM
VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis
Language:Python106 6 137
opendatalab/OHR-Bench
(ICCV 2025) OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation
Language:Python93 5 712
opendatalab/FakeVLM
[NeurIPS 2025 🔥] FakeVLM: Advancing Synthetic Image Detection through Explainable Multimodal Models and Fine-Grained Artifact Analysis
Language:Python83 2 134
opendatalab/MLS-BRN
[CVPR 2024] 3D Building Reconstruction from Monocular Remote Sensing Images with Multi-level Supervisions
Language:Python80 9 136
opendatalab/Vis3
Data browser based on s3. 一个基于 S3 的数据（json / jsonl / html / md等）可视化工具。👇 Try online.
Language:TypeScript78 3 010
opendatalab/skydiffusion
[ICCV 2025] The official implementation of the paper “Street-to-Satellite Image Synthesis with Diffusion Models and BEV Paradigm”
Language:Python76 1 35
opendatalab/LEGION
[ICCV25 Highlight] The official implementation of the paper "LEGION: Learning to Ground and Explain for Synthetic Image Detection"
Language:Python67 2 65
opendatalab/mineru-vl-utils
A Python package for interacting with the MinerU Vision-Language Model.
Language:Python65 0 314
opendatalab/opendatalab-python-sdk
SDK of OpenDataLab - https://opendatalab.org.cn
Language:Python57 2 35
opendatalab/Earth-Agent
Language:Python527
opendatalab/ProverGen
[ICLR 2025] This is the official implementation for the paper: "Large Language Models Meet Symbolic Provers for Logical Reasoning Evaluation"
Language:Python364
opendatalab/UrBench
[AAAI 2025]This repo contains evaluation code for the paper “UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios”
Language:Python35 1 15
opendatalab/REST
Language:Python322
opendatalab/PM4Bench
Language:Python14 1 0
opendatalab/awesome-markdown-ebooks
122
opendatalab/OpenHuEval
Language:Python90
opendatalab/GRAIT
[NAACL25 findings] Gradient-Driven Refusal-Aware Instruction Tuning for Effective Hallucination Mitigation
Language:Python3
opendatalab/.github
1 2 02
opendatalab/opendatalab.github.io
Language:HTML