data-profiling

There are 88 repositories under data-profiling topic.

ydataai/ydata-profiling
1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
Language:Python13.1k 150 8481.7k
cleanlab/cleanlab
Cleanlab's open-source library is the standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
Language:Python10.9k 88 374850
great-expectations/great_expectations
Always know what to expect from your data.
Language:Python10.8k 83 2k1.6k
open-metadata/OpenMetadata
OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.
Language:TypeScript7.5k 50 8.5k1.4k
fbdesignpro/sweetviz
Visualize and compare datasets, target values and associations, with one line of code.
Language:Python3k 53 143288
sodadata/soda-core
:zap: Data quality testing for the modern data stack (SQL, Spark, and Pandas) https://www.soda.io
Language:Python2.2k 13 395242
hi-primus/optimus
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
Language:Python1.5k 36 219233
opendatadiscovery/odd-platform
First open-source data discovery and observability platform. We make a life for data practitioners easy so you can focus on your business.
Language:Java1.3k 18 645128
cleanlab/cleanvision
Automatically find issues in image datasets and practice data-centric computer vision.
Language:Python1.1k 16 8575
datavane/datavines
Know your data better！Datavines is Next-gen Data Observability Platform, support metadata manage and data quality.
Language:Java667 13 201185
polyaxon/traceml
Engine for ML/Data tracking, visualization, explainability, drift detection, and dashboards for Polyaxon.
Language:Python520 12 1446
ing-bank/popmon
Monitor the stability of a Pandas or Spark dataframe ⚙︎
Language:Python505 13 5436
InfuseAI/piperider
Code review for data in dbt
Language:Python490 12 7524
polyaxon/haupt
Lineage metadata API, artifacts streams, sandbox, API, and spaces for Polyaxon
Language:Python452 35 0210
Desbordante/desbordante-core
Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.
Language:C++420 9 7980
databrickslabs/dqx
Databricks framework to validate Data Quality of pySpark DataFrames
Language:Python313 7 18860
dqops/dqo
Data Quality and Observability platform for the whole data lifecycle, from profiling new data sources to full automation with Data Observability. Configure data quality checks from the UI or in YAML files, let DQOps run the data quality checks daily to detect data quality issues.
Language:Java165 9 1335
hi-primus/bumblebee
🚕 A spreadsheet-like data preparation web app that works over Optimus (Pandas, Dask, cuDF, Dask-cuDF, Spark and Vaex)
Language:Vue141 11 11435
DataKitchen/data-observability-installer
Installer for DataKitchen's Open Source Data Observability Products. Data breaks. Servers break. Your toolchain breaks. Ensure your team is the first to know and the first to solve with visibility across and down your data estate. Save time with simple, fast data quality test generation and execution. Trust your data, tools, and systems end to end.
Language:Python124 4 1712
SJTU-DMTai/awesome-ml-data-quality-papers
Papers about training data quality management for ML models.
97 2 16
Swiple/swiple
Swiple enables you to easily observe, understand, validate and improve the quality of your data
Language:Python84 2 311
psebenick/data-profiling
a set of scripts to pull meta data and data profiling metrics from relational database systems
Language:Python77 7 119
apicrafter/metacrafter
Metadata and data identification tool and Python library. Identifies PII, common identifiers, language specific identifiers. Fully customizable and flexible rules
Language:Python45 3 275
opendatadiscovery/odd-collector
Open-source metadata collector based on ODD Specification
Language:Python44 3 7813
VIDA-NYU/auctus
Dataset search engine, discovering data from a variety of sources, profiling it, and allowing advanced queries on the index
Language:Python44 6 09
cleanlab/cleanlab-studio
Client interface to Cleanlab Studio
Language:Python32 5 1810
tsegall/fta
Metadata/data identification Java library. Identifies Semantic Type information (e.g. Gender, Age, Color, Country,...). Extensive country/language support. Extensible via user-defined plugins. Comprehensive Profiling support.
Language:Java30 5 643
dm4ml/gate
Drift detection module for machine learning pipelines.
Language:Python25 2 82
ismaildawoodjee/GreatEx
A project for exploring how Great Expectations can be used to ensure data quality and validate batches within a data pipeline defined in Airflow.
Language:Python21 4 27
raymon-ai/raymon
The official http://raymon.ai data profiling and logging library.
Language:Python18 3 611
baligoyem/dataqtor
🔍Your Data Quality Detector / Gain insight into your data and get it ready for use before you start working with it 💡📊🛠💎
Language:Python16 1 28
open-metadata/openmetadata-site
Open Standard for Metadata. A Single place to Discover, Collaborate and Get your data right.
Language:TypeScript16 4 813
ahmadassaf/roomba
A Node.js tool to examine the correctness of Open Data Metadata and build custom dataset profiles
Language:JavaScript12 4 153
CoDS-GCS/kglids
Linked Data Science powered by Knowledge Graphs
Language:Python11 3 111
LieseB-1746743/data-cleaning
Data cleaning tool.
Language:JavaScript9 1 04
SebastianSchmidl/distod
DISTOD algorithm: Distributed discovery of bidirectional order dependencies
Language:Scala9 1 02

data-profiling

ydataai/ydata-profiling

cleanlab/cleanlab

great-expectations/great_expectations

open-metadata/OpenMetadata

fbdesignpro/sweetviz

sodadata/soda-core

hi-primus/optimus

opendatadiscovery/odd-platform

cleanlab/cleanvision

datavane/datavines

polyaxon/traceml

ing-bank/popmon

InfuseAI/piperider

polyaxon/haupt

Desbordante/desbordante-core

databrickslabs/dqx

dqops/dqo

hi-primus/bumblebee

DataKitchen/data-observability-installer

SJTU-DMTai/awesome-ml-data-quality-papers

Swiple/swiple

psebenick/data-profiling

apicrafter/metacrafter

opendatadiscovery/odd-collector

VIDA-NYU/auctus

cleanlab/cleanlab-studio

tsegall/fta

dm4ml/gate

ismaildawoodjee/GreatEx

raymon-ai/raymon

baligoyem/dataqtor

open-metadata/openmetadata-site

ahmadassaf/roomba

CoDS-GCS/kglids

LieseB-1746743/data-cleaning

SebastianSchmidl/distod