data-curation
There are 62 repositories under data-curation topic.
cleanlab/cleanlab
The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
voxel51/fiftyone
The open-source tool for building high-quality datasets and computer vision models
Docta-ai/docta
A Doctor for your data
visual-layer/fastdup
fastdup is a powerful free tool designed to rapidly extract valuable insights from your image & video datasets. Assisting you to increase your dataset images & labels quality and reduce your data operations costs at an unparalleled scale.
Renumics/spotlight
Interactively explore unstructured datasets from your dataframe.
daochenzha/data-centric-AI
A curated, but incomplete, list of data-centric AI resources.
Renumics/awesome-open-data-centric-ai
Curated list of open source tooling for data-centric AI on unstructured data.
getmetamapper/metamapper
Metamapper is a data discovery and documentation platform for improving how teams understand and interact with their data.
Renumics/sliceguard
A library for detecting problematic data segments in structured and unstructured data with few lines of code.
LaureBerti/Learn2Clean
Learn2Clean: Optimizing the Sequence of Tasks for Data Preparation and Cleaning
whythawk/data-as-a-science
Lesson guide and textbook for "Data as a Science" course.
x-CK-x/Dataset-Curation-Tool
A tool for downloading from public image boards (which allow scraping) / preview your images & tags / edit your images & tags. Additional tabs for downloading other desired code repositories as well as S.O.T.A. diffusion and auto-tag/caption models for your purposes. Custom datasets can be added!
iwangjian/TopDial
Code and data for "Target-oriented Proactive Dialogue Systems with Personalization: Problem Formulation and Dataset Curation" (EMNLP 2023)
cleanlab/cleanlab-studio
Client interface for all things Cleanlab Studio
neo-chem/awesome-chemical-data
Curated list of known efforts in collecting and/or curating of chemical/materials data
PennLINC/CuBIDS
Curation of BIDS (CuBIDS): A sanity-preserving software package for processing BIDS datasets.
Digital-Dermatology/SelfClean
🧼🔎 A holistic self-supervised data cleaning strategy to detect irrelevant samples, near duplicates and label errors.
mcsorkun/AqSolDB
AqSolDB: A curated aqueous solubility dataset contains 9.982 unique compounds.
Grelot/global_fish_genetic_diversity
Codes I wrote for the paper : "Global determinants of freshwater and marine fish genetic diversity" Nature Communications, 2020
VIDA-NYU/openclean-core
Data Cleaning and Data Profiling Library for Python
thehyve/tmtk
tranSMART Arborist ETL toolkit
TieuLongPhan/SynRBL
Rebalancing chemical reaction
UHBristolDataScience/ICNARC-to-Philips-Linkage
Code for data linkage (curation of research database).
ARUP-CAS/aiscr-webamcr
Archeologická mapa České republiky
cgnorthcutt/reliablity_framework_for_rag
Demo showing how the Trustworthy Language Model add reliability to LLM outputs and improves RAG, agents, and data enrichment worfklows. can be used to improve fine-tuning of LLMs, accuracy of LLM outputs, and smart routing for RAG and agents.
johannesuhl/hisdac-es
HISDAC-ES: Creating historical settlement data for Spain (1900-2020) based on cadastral building footprint data
ARUP-CAS/aiscr-digiarchiv-2
Digitální archiv AMČR
halbritter-lab/gene-curator
Gene Curator is an open-source platform for managing and curating genetic data. It facilitates gene data analysis, entry, and reporting, serving genetics researchers with tools for efficient data handling.
PR-Desai2226/Web-Scraping
Web Scraping & Text Data Collecting and Curating for Maithili Language. Also Language Modeling for collected data.
thehyve/arborist
TranSMART Arborist: Graphical tool for reshaping your data for the tranSMART data warehouse.
UAL-RE/ldcoolp-figshare
Python tool using the Figshare API for data curation
voxel51/fiftyone_mlflow_plugin
Track model training experiments with MLflow and FiftyOne!
Henrium/ET-AL
Entropy-targeted active learning for bias mitigation in materials data.
Manuelkila/Data-curation
This is a task for Hamoye stage E Internship on Data curation, with focus on Amazon Book website.
yago-mendoza/suskind-knowledge-graph
Graph-based NLP framework leveraging a curated database and an intuitive CLI for advanced, context-rich language understanding.