dataset
There are 12865 repositories under dataset topic.
public-apis/public-apis
A collective list of free APIs
HumanSignal/label-studio
Label Studio is a multi-type data labeling and annotation tool with standardized output format
joke2k/faker
Faker is a Python package that generates fake data for you.
lukas-blecher/LaTeX-OCR
pix2tex: Using a ViT to convert images of equations into LaTeX code.
cvat-ai/cvat
Annotate better with CVAT, the industry-leading data engine for machine learning. Used and trusted by teams at any scale, for data of any scale.
zalandoresearch/fashion-mnist
A MNIST-like fashion product database. Benchmark :point_down:
ConardLi/easy-dataset
A powerful tool for creating fine-tuning datasets for LLM
doccano/doccano
Open source annotation tool for machine learning practitioners.
brightmart/nlp_chinese_corpus
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
satellite-image-deep-learning/techniques
Techniques for deep learning with satellite & aerial imagery
NirantK/awesome-project-ideas
Curated list of Machine Learning, NLP, Vision, Recommender Systems Project Ideas
googlecreativelab/quickdraw-dataset
Documentation on how to access and use the Quick, Draw! Dataset.
mdn/browser-compat-data
Browser compatibility data for Web technologies as displayed on MDN
lonePatient/awesome-pretrained-chinese-nlp-models
Awesome Pretrained Chinese NLP Models,高质量中文预训练模型&大模型&多模态模型&大语言模型集合
SPLWare/esProc
esProc SPL is a JVM-based programming language designed for structured data computation, serving as both a data analysis tool and an embedded computing engine.
tensorflow/datasets
TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
whoiskatrin/sql-translator
SQL Translator is a tool for converting natural language queries into SQL code using artificial intelligence. This project is 100% free and open source.
CLUEbenchmark/CLUE
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
wainshine/Chinese-Names-Corpus
中文人名语料库。人名生成器。中文姓名,姓氏,名字,称呼,日本人名,翻译人名,英文人名。可用于中文分词、人名实体识别。
rom1504/img2dataset
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
hyunwoongko/transformer
Transformer: PyTorch Implementation of "Attention Is All You Need"
OpenCSGs/csghub
CSGHub is a brand-new open-source platform for managing LLMs, developed by the OpenCSG team. It offers both open-source and on-premise/SaaS solutions, with features comparable to Hugging Face. Gain full control over the lifecycle of LLMs, datasets, and agents, with Python SDK compatibility with Hugging Face. Join us! ⭐️
Charmve/Surface-Defect-Detection
📈 目前最大的工业缺陷检测数据库及论文集 Constantly summarizing open source dataset and critical papers in the field of surface defect research which are of great importance.
mlabonne/llm-datasets
Curated list of datasets and tools for post-training.
Belval/TextRecognitionDataGenerator
A synthetic data generator for text recognition
pytorch/text
Models, data loaders and abstractions for language processing, powered by PyTorch
jdorfman/awesome-json-datasets
A curated list of awesome JSON datasets that don't require authentication.
linhandev/dataset
医学影像数据集列表 『An Index for Medical Imaging Datasets』
Zjh-819/LLMDataHub
A quick guide (especially) for trending instruction finetuning datasets
pydata/pandas-datareader
Extract data from a wide range of Internet sources into a pandas DataFrame.
ieee8023/covid-chestxray-dataset
We are building an open database of COVID-19 cases with chest X-ray or CT images.
waymo-research/waymo-open-dataset
Waymo Open Dataset
whylabs/whylogs
An open-source data logging library for machine learning models and data pipelines. 📚 Provides visibility into data quality & model performance over time. 🛡️ Supports privacy-preserving data collection, ensuring safety & robustness. 📈
M-3LAB/awesome-industrial-anomaly-detection
Paper list and datasets for industrial image anomaly/defect detection (updating). 工业异常/瑕疵检测论文及数据集检索库(持续更新)。
ashvardanian/StringZilla
Up to 100x faster strings for C, C++, CUDA, Python, Rust, Swift, JS, & Go, leveraging NEON, AVX2, AVX-512, SVE, GPGPU, & SWAR to accelerate search, hashing, sorting, edit distances, sketches, and memory ops 🦖
meodai/color-names
Large list of handpicked color names 🌈