data-curation

There are 91 repositories under data-curation topic.

  • cleanlab/cleanlab

    Cleanlab's open-source library is the standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.

    Language:Python10.9k88374850
  • fiftyone

    voxel51/fiftyone

    Refine high-quality datasets and visual AI models

    Language:Python9.9k671.7k665
  • docta

    Docta-ai/docta

    A Doctor for your data

    Language:Python3.5k1343231
  • visual-layer/fastdup

    fastdup is a powerful, free tool designed to rapidly generate valuable insights from image and video datasets. It helps enhance the quality of both images and labels, while significantly reducing data operation costs, all with unmatched scalability.

    Language:Python1.7k2326384
  • Renumics/spotlight

    Interactively explore unstructured datasets from your dataframe.

    Language:TypeScript1.2k189387
  • NVIDIA-NeMo/Curator

    Scalable data pre processing and curation toolkit for LLMs

    Language:Python1.1k18259174
  • data-centric-AI

    daochenzha/data-centric-AI

    A curated, but incomplete, list of data-centric AI resources.

  • awesome-open-data-centric-ai

    Renumics/awesome-open-data-centric-ai

    Curated list of open source tooling for data-centric AI on unstructured data.

  • UCSC-REAL/DS2

    [ICLR 2025] Official implementation of paper "Improving Data Efficiency via Curating LLM-Driven Rating Systems"

    Language:Python96517
  • getmetamapper/metamapper

    Metamapper is a data discovery and documentation platform for improving how teams understand and interact with their data.

    Language:Python80526
  • Renumics/sliceguard

    A library for detecting problematic data segments in structured and unstructured data with few lines of code.

    Language:Python64523
  • LaureBerti/Learn2Clean

    Learn2Clean: Optimizing the Sequence of Tasks for Data Preparation and Cleaning

    Language:Python512520
  • whythawk/data-as-a-science

    Lesson guide and textbook for "Data as a Science" course.

    Language:Jupyter Notebook478199
  • x-CK-x/Dataset-Curation-Tool

    A tool for downloading from public image boards (which allow scraping) / preview your images & tags / edit your images & tags. Additional tabs for downloading other desired code repositories as well as S.O.T.A. diffusion and auto-tag/caption models for your purposes. Custom datasets can be added!

    Language:Python382497
  • Digital-Dermatology/SelfClean

    🧼🔎 A holistic self-supervised data cleaning strategy to detect irrelevant samples, near duplicates and label errors (NeurIPS'24).

    Language:Python36431
  • cleanlab/cleanlab-studio

    Client interface to Cleanlab Studio

    Language:Python3251810
  • brainlife/ezbids

    A web service for semi-automated conversion of raw imaging data to BIDS

    Language:Vue3136418
  • iwangjian/TopDial

    Target-oriented Proactive Dialogue Systems with Personalization: Problem Formulation and Dataset Curation (EMNLP 2023)

    Language:Python30311
  • PennLINC/CuBIDS

    Curation of BIDS (CuBIDS): A sanity-preserving software package for processing BIDS datasets.

    Language:Python28525412
  • TieuLongPhan/SynRBL

    Rebalancing chemical reaction

    Language:Python251112
  • mcsorkun/AqSolDB

    AqSolDB: A curated aqueous solubility dataset contains 9.982 unique compounds.

    Language:Python23102
  • neo-chem/awesome-chemical-data

    Curated list of known efforts in collecting and/or curating of chemical/materials data

  • MigoXLab/awesome-data-quality

    A comprehensive collection of data quality resources, tools, papers, and projects across various data types including traditional data, LLM pretraining/fine-tuning data, multimodal data, and more. Essential reference for researchers and practitioners in data-centric AI.

  • BEXIS2/Core

    This is the public code repository of the BEXIS2 data management software. It contains only modules, components, and packages of the core system. Contributed modules and components will be available in separate repositories. For more information on BEXIS2, please visit our website.

    Language:JavaScript17151.6k14
  • pg-space/panspace

    Embedding-based indexing for compact storage, rapid querying, and curation of bacterial pan-genomes

    Language:Jupyter Notebook11210
  • VIDA-NYU/openclean-core

    Data Cleaning and Data Profiling Library for Python

    Language:Python115483
  • Grelot/global_fish_genetic_diversity

    Codes I wrote for the paper : "Global determinants of freshwater and marine fish genetic diversity" Nature Communications, 2020

    Language:R9100
  • Academich/reagent_emb_vis

    Reaction data exploration: a map of reagents with regions of similar reagent purpose.

    Language:Python7121
  • johannesuhl/hisdac-es

    HISDAC-ES: Creating historical settlement data for Spain (1900-2020) based on cadastral building footprint data

    Language:Python6301
  • thehyve/tmtk

    tranSMART Arborist ETL toolkit

    Language:Python610194
  • UHBristolDataScience/ICNARC-to-Philips-Linkage

    Code for data linkage (curation of research database).

    Language:Jupyter Notebook6203
  • ARUP-CAS/aiscr-webamcr

    Archaeological Map of the Czech Republic (AMCR)

    Language:Python551.1k0
  • cgnorthcutt/reliablity_framework_for_rag

    Demo showing how the Trustworthy Language Model add reliability to LLM outputs and improves RAG, agents, and data enrichment worfklows. can be used to improve fine-tuning of LLMs, accuracy of LLM outputs, and smart routing for RAG and agents.

    Language:Jupyter Notebook5202
  • halbritter-lab/gene-curator

    Gene Curator is an open-source platform for managing and curating genetic data. It facilitates gene data analysis, entry, and reporting, serving genetics researchers with tools for efficient data handling.

    Language:Vue531141
  • voxel51/fiftyone_mlflow_plugin

    Track model training experiments with MLflow and FiftyOne!

    Language:Python5140