data-quality
There are 324 repositories under data-quality topic.
GokuMohandas/Made-With-ML
Learn how to design, develop, deploy and iterate on production-grade ML applications.
eugeneyan/applied-ml
π Papers & tech blogs by companies sharing their work on data science & machine learning in production.
ydataai/ydata-profiling
1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
great-expectations/great_expectations
Always know what to expect from your data.
cleanlab/cleanlab
The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
voxel51/fiftyone
Refine high-quality datasets and visual AI models
feast-dev/feast
The Open Source Feature Store for Machine Learning
open-metadata/OpenMetadata
OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.
evidentlyai/evidently
Evidently is ββan open-source ML and LLM observability framework. Evaluate, test, and monitor any AI-powered system or data pipeline. From tabular data to Gen AI. 100+ metrics.
treeverse/lakeFS
lakeFS - Data version control for your data lake | Git for data
GokuMohandas/mlops-course
Learn how to design, develop, deploy and iterate on production-grade ML applications.
datafold/data-diff
Compare tables within or across databases
whylabs/whylogs
An open-source data logging library for machine learning models and data pipelines. π Provides visibility into data quality & model performance over time. π‘οΈ Supports privacy-preserving data collection, ensuring safety & robustness. π
feathr-ai/feathr
Feathr β A scalable, unified data and AI engineering platform for enterprise
sodadata/soda-core
:zap: Data quality testing for the modern data stack (SQL, Spark, and Pandas) https://www.soda.io
featureform/featureform
The Virtual Feature Store. Turn your existing data infrastructure into a feature store.
re-data/re-data
re_data - fix data issues before your users & CEO would discover them π
opendatadiscovery/odd-platform
First open-source data discovery and observability platform. We make a life for data practitioners easy so you can focus on your business.
daochenzha/data-centric-AI
A curated, but incomplete, list of data-centric AI resources.
cleanlab/cleanvision
Automatically find issues in image datasets and practice data-centric computer vision.
rstudio/pointblank
Data quality assessment and metadata reporting for data frames and database tables
opendatadiscovery/awesome-data-catalogs
π Awesome Data Catalogs and Observability Platforms.
kennethleungty/Failed-ML
Compilation of high-profile real-world examples of failed machine learning projects
WeBankFinTech/Qualitis
Qualitis is a one-stop data quality management platform that supports quality verification, notification, and management for various datasource. It is used to solve various data quality problems caused by data processing. https://github.com/WeBankFinTech/Qualitis
NVIDIA/NeMo-Curator
Scalable data pre processing and curation toolkit for LLMs
polyaxon/traceml
Engine for ML/Data tracking, visualization, explainability, drift detection, and dashboards for Polyaxon.
InfuseAI/piperider
Code review for data in dbt
encord-team/encord-active
The toolkit to test, validate, and evaluate your models and surface, curate, and prioritize the most valuable data for labeling.
bitol-io/open-data-contract-standard
Home of the Open Data Contract Standard (ODCS).
Data-Centric-AI-Community/awesome-data-centric-ai
Open-Source Software, Tutorials, and Research on Data-Centric AI π€
data-drift/data-drift
Metrics Observability & Troubleshooting
alibaba/feathub
FeatHub - A stream-batch unified feature store for real-time machine learning
ubisoft/mobydq
:whale: Tool to automate data quality checks on data pipelines
adidas/lakehouse-engine
The Lakehouse Engine is a configuration driven Spark framework, written in Python, serving as a scalable and distributed engine for several lakehouse algorithms, data flows and utilities for Data Products.
frederick0329/TracIn
Implementation of Estimating Training Data Influence by Tracing Gradient Descent (NeurIPS 2020)
GAIR-NLP/ProX
Offical Repo for "Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale"