data-centric-ai
There are 79 repositories under data-centric-ai topic.
cleanlab/cleanlab
Cleanlab's open-source library is the standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
voxel51/fiftyone
Refine high-quality datasets and visual AI models
Docta-ai/docta
A Doctor for your data
code-kern-ai/refinery
The data scientist's open-source choice to scale, assess and maintain natural language data. Treat training data like a software artifact.
Renumics/spotlight
Interactively explore unstructured datasets from your dataframe.
HazyResearch/data-centric-ai
Resources for Data Centric AI
daochenzha/data-centric-AI
A curated, but incomplete, list of data-centric AI resources.
cleanlab/cleanvision
Automatically find issues in image datasets and practice data-centric computer vision.
Renumics/awesome-open-data-centric-ai
Curated list of open source tooling for data-centric AI on unstructured data.
dcai-course/dcai-lab
Lab assignments for Introduction to Data-Centric AI, MIT IAP 2024 π©π½βπ»
gszfwsb/NCFM
Official PyTorch implementation of the paper "Dataset Distillation with Neural Characteristic Function: A Minmax Perspective" (NCFM) in CVPR 2025 (Highlight).
GAIR-NLP/ProX
[ICML 2025] Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale
JieyuZ2/wrench
[NeurIPS 2021] WRENCH: Weak supeRvision bENCHmark
yueyu1030/AttrPrompt
[NeurIPS 2023] This is the code for the paper `Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias`.
aai-institute/pyDVL
pyDVL is a library of stable implementations of algorithms for data valuation and influence function computation
dcai-course/dcai-course
Introduction to Data-Centric AI, MIT IAP 2024 π€
opendataval/opendataval
OpenDataVal: a Unified Benchmark for Data Valuation in Python (NeurIPS 2023)
SJTU-DMTai/awesome-ml-data-quality-papers
Papers about training data quality management for ML models.
OFA-Sys/DiverseEvol
Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning
NextBrain-ai/nbsynthetic
nbsynthetic is simple and robust tabular synthetic data generation library for small and medium size datasets
TonyLianLong/UnsupervisedSelectiveLabeling
[ECCV 2022] Official Implementation for Unsupervised Selective Labeling for More Effective Semi-Supervised Learning
astutic/Acharya
A Data Centric NER annotation tool for your Named Entity Recognition projects
koalazf99/Awesome-DataCentric-LLM
Trending projects & awesome papers about data-centric llm studies.
luo-junyu/Awesome-Data-Efficient-LLM
A list of data-efficient and data-centric LLM (Large Language Model) papers. Our Survey Paper: Towards Efficient LLM Post Training: A Data-centric Perspective
awesome-mlops/awesome-data-management
A curated list of awesome open source tools and commercial products to catalog, version, and manage data π
Digital-Dermatology/SelfClean
π§Όπ A holistic self-supervised data cleaning strategy to detect irrelevant samples, near duplicates and label errors (NeurIPS'24).
KibromBerihu/ai4elife
This data-centric AI repository implements a robust deep learning method (LFBNet) for fully automated tumor segmentation in whole-body [18]F-FDG PET/CT images.
sail-sg/D-TRAK
Intriguing Properties of Data Attribution on Diffusion Models (ICLR 2024)
cleanlab/cleanlab-studio
Client interface to Cleanlab Studio
nachifur/LLPC
Frontiers in Neuroinformatics 2022: Local Label Point Correction for Edge Detection of Overlapping Cervical Cells
ear-team/bambird
Unsupervised classification to improve the quality of a bird song recording dataset. https://doi.org/10.1016/j.ecoinf.2022.101952
voxel51/reconstruction-error-ratios
Estimate dataset difficulty and detect label mistakes using reconstruction error ratios!
Lichang-Chen/AlpaGasus
A better Alpaca Model Trained with Less Data (only 9k instructions of the original set)
IS2AI/AnyFace
Input-Agnostic Face Detection
kennethleungty/Data-Centric-AI-Competition
Codes for a Top 5% finish in the Data-Centric AI Competition organized by Andrew Ng and DeepLearning.AI
autonlab/aqua
AQuA: A Benchmarking Tool for Label Quality Assessment, NeurIPS'23 D&B