dataset
There are 10981 repositories under dataset topic.
public-apis/public-apis
A collective list of free APIs
HumanSignal/label-studio
Label Studio is a multi-type data labeling and annotation tool with standardized output format
joke2k/faker
Faker is a Python package that generates fake data for you.
lukas-blecher/LaTeX-OCR
pix2tex: Using a ViT to convert images of equations into LaTeX code.
cvat-ai/cvat
Annotate better with CVAT, the industry-leading data engine for machine learning. Used and trusted by teams at any scale, for data of any scale.
zalandoresearch/fashion-mnist
A MNIST-like fashion product database. Benchmark :point_down:
doccano/doccano
Open source annotation tool for machine learning practitioners.
brightmart/nlp_chinese_corpus
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
satellite-image-deep-learning/techniques
Techniques for deep learning with satellite & aerial imagery
NirantK/awesome-project-ideas
Curated list of Machine Learning, NLP, Vision, Recommender Systems Project Ideas
googlecreativelab/quickdraw-dataset
Documentation on how to access and use the Quick, Draw! Dataset.
mdn/browser-compat-data
This repository contains compatibility data for Web technologies as displayed on MDN
lonePatient/awesome-pretrained-chinese-nlp-models
Awesome Pretrained Chinese NLP Models,高质量中文预训练模型&大模型&多模态模型&大语言模型集合
SPLWare/esProc
esProc SPL is a scripting language for data processing, with well-designed rich library functions and powerful syntax, which can be executed in a Java program through JDBC interface and computing independently.
tensorflow/datasets
TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
whoiskatrin/sql-translator
SQL Translator is a tool for converting natural language queries into SQL code using artificial intelligence. This project is 100% free and open source.
CLUEbenchmark/CLUE
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
wainshine/Chinese-Names-Corpus
中文人名语料库。人名生成器。中文姓名,姓氏,名字,称呼,日本人名,翻译人名,英文人名。可用于中文分词、人名实体识别。
rom1504/img2dataset
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
pytorch/text
Models, data loaders and abstractions for language processing, powered by PyTorch
jdorfman/awesome-json-datasets
A curated list of awesome JSON datasets that don't require authentication.
Belval/TextRecognitionDataGenerator
A synthetic data generator for text recognition
Charmve/Surface-Defect-Detection
📈 目前最大的工业缺陷检测数据库及论文集 Constantly summarizing open source dataset and critical papers in the field of surface defect research which are of great importance.
hyunwoongko/transformer
Transformer: PyTorch Implementation of "Attention Is All You Need"
modelscope/data-juicer
Making data higher-quality, juicier, and more digestible for foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!
ieee8023/covid-chestxray-dataset
We are building an open database of COVID-19 cases with chest X-ray or CT images.
pydata/pandas-datareader
Extract data from a wide range of Internet sources into a pandas DataFrame.
waymo-research/waymo-open-dataset
Waymo Open Dataset
linhandev/dataset
医学影像数据集列表 『An Index for Medical Imaging Datasets』
Zjh-819/LLMDataHub
A quick guide (especially) for trending instruction finetuning datasets
whylabs/whylogs
An open-source data logging library for machine learning models and data pipelines. 📚 Provides visibility into data quality & model performance over time. 🛡️ Supports privacy-preserving data collection, ensuring safety & robustness. 📈
GeorgeSeif/Semantic-Segmentation-Suite
Semantic Segmentation Suite in TensorFlow. Implement, train, and test new Semantic Segmentation models easily!
unsplash/datasets
🎁 5,400,000+ Unsplash images made available for research and machine learning
meodai/color-names
Large list of handpicked color names 🌈
luban-agi/Awesome-Domain-LLM
收集和梳理垂直领域的开源模型、数据集及评测基准。
ashvardanian/StringZilla
Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging NEON, AVX2, AVX-512, and SWAR to accelerate search, sort, edit distances, alignment scores, etc 🦖