data-curation
There are 91 repositories under data-curation topic.
cleanlab/cleanlab
Cleanlab's open-source library is the standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
voxel51/fiftyone
Refine high-quality datasets and visual AI models
Docta-ai/docta
A Doctor for your data
visual-layer/fastdup
fastdup is a powerful, free tool designed to rapidly generate valuable insights from image and video datasets. It helps enhance the quality of both images and labels, while significantly reducing data operation costs, all with unmatched scalability.
Renumics/spotlight
Interactively explore unstructured datasets from your dataframe.
NVIDIA-NeMo/Curator
Scalable data pre processing and curation toolkit for LLMs
daochenzha/data-centric-AI
A curated, but incomplete, list of data-centric AI resources.
Renumics/awesome-open-data-centric-ai
Curated list of open source tooling for data-centric AI on unstructured data.
UCSC-REAL/DS2
[ICLR 2025] Official implementation of paper "Improving Data Efficiency via Curating LLM-Driven Rating Systems"
getmetamapper/metamapper
Metamapper is a data discovery and documentation platform for improving how teams understand and interact with their data.
Renumics/sliceguard
A library for detecting problematic data segments in structured and unstructured data with few lines of code.
LaureBerti/Learn2Clean
Learn2Clean: Optimizing the Sequence of Tasks for Data Preparation and Cleaning
whythawk/data-as-a-science
Lesson guide and textbook for "Data as a Science" course.
x-CK-x/Dataset-Curation-Tool
A tool for downloading from public image boards (which allow scraping) / preview your images & tags / edit your images & tags. Additional tabs for downloading other desired code repositories as well as S.O.T.A. diffusion and auto-tag/caption models for your purposes. Custom datasets can be added!
Digital-Dermatology/SelfClean
🧼🔎 A holistic self-supervised data cleaning strategy to detect irrelevant samples, near duplicates and label errors (NeurIPS'24).
cleanlab/cleanlab-studio
Client interface to Cleanlab Studio
brainlife/ezbids
A web service for semi-automated conversion of raw imaging data to BIDS
iwangjian/TopDial
Target-oriented Proactive Dialogue Systems with Personalization: Problem Formulation and Dataset Curation (EMNLP 2023)
PennLINC/CuBIDS
Curation of BIDS (CuBIDS): A sanity-preserving software package for processing BIDS datasets.
TieuLongPhan/SynRBL
Rebalancing chemical reaction
mcsorkun/AqSolDB
AqSolDB: A curated aqueous solubility dataset contains 9.982 unique compounds.
neo-chem/awesome-chemical-data
Curated list of known efforts in collecting and/or curating of chemical/materials data
MigoXLab/awesome-data-quality
A comprehensive collection of data quality resources, tools, papers, and projects across various data types including traditional data, LLM pretraining/fine-tuning data, multimodal data, and more. Essential reference for researchers and practitioners in data-centric AI.
BEXIS2/Core
This is the public code repository of the BEXIS2 data management software. It contains only modules, components, and packages of the core system. Contributed modules and components will be available in separate repositories. For more information on BEXIS2, please visit our website.
pg-space/panspace
Embedding-based indexing for compact storage, rapid querying, and curation of bacterial pan-genomes
VIDA-NYU/openclean-core
Data Cleaning and Data Profiling Library for Python
Grelot/global_fish_genetic_diversity
Codes I wrote for the paper : "Global determinants of freshwater and marine fish genetic diversity" Nature Communications, 2020
Academich/reagent_emb_vis
Reaction data exploration: a map of reagents with regions of similar reagent purpose.
johannesuhl/hisdac-es
HISDAC-ES: Creating historical settlement data for Spain (1900-2020) based on cadastral building footprint data
thehyve/tmtk
tranSMART Arborist ETL toolkit
UHBristolDataScience/ICNARC-to-Philips-Linkage
Code for data linkage (curation of research database).
ARUP-CAS/aiscr-webamcr
Archaeological Map of the Czech Republic (AMCR)
cgnorthcutt/reliablity_framework_for_rag
Demo showing how the Trustworthy Language Model add reliability to LLM outputs and improves RAG, agents, and data enrichment worfklows. can be used to improve fine-tuning of LLMs, accuracy of LLM outputs, and smart routing for RAG and agents.
halbritter-lab/gene-curator
Gene Curator is an open-source platform for managing and curating genetic data. It facilitates gene data analysis, entry, and reporting, serving genetics researchers with tools for efficient data handling.
voxel51/fiftyone_mlflow_plugin
Track model training experiments with MLflow and FiftyOne!