data-profiling

There are 88 repositories under data-profiling topic.

  • ydataai/ydata-profiling

    1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.

    Language:Python13.1k1508481.7k
  • cleanlab/cleanlab

    Cleanlab's open-source library is the standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.

    Language:Python10.9k88374850
  • great-expectations/great_expectations

    Always know what to expect from your data.

    Language:Python10.8k832k1.6k
  • OpenMetadata

    open-metadata/OpenMetadata

    OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.

    Language:TypeScript7.5k508.5k1.4k
  • fbdesignpro/sweetviz

    Visualize and compare datasets, target values and associations, with one line of code.

    Language:Python3k53143288
  • soda-core

    sodadata/soda-core

    :zap: Data quality testing for the modern data stack (SQL, Spark, and Pandas) https://www.soda.io

    Language:Python2.2k13395242
  • hi-primus/optimus

    :truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark

    Language:Python1.5k36219233
  • odd-platform

    opendatadiscovery/odd-platform

    First open-source data discovery and observability platform. We make a life for data practitioners easy so you can focus on your business.

    Language:Java1.3k18645128
  • cleanlab/cleanvision

    Automatically find issues in image datasets and practice data-centric computer vision.

    Language:Python1.1k168575
  • datavane/datavines

    Know your data better!Datavines is Next-gen Data Observability Platform, support metadata manage and data quality.

    Language:Java66713201185
  • polyaxon/traceml

    Engine for ML/Data tracking, visualization, explainability, drift detection, and dashboards for Polyaxon.

    Language:Python520121446
  • popmon

    ing-bank/popmon

    Monitor the stability of a Pandas or Spark dataframe ⚙︎

    Language:Python505135436
  • piperider

    InfuseAI/piperider

    Code review for data in dbt

    Language:Python490127524
  • polyaxon/haupt

    Lineage metadata API, artifacts streams, sandbox, API, and spaces for Polyaxon

    Language:Python452350210
  • desbordante-core

    Desbordante/desbordante-core

    Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.

    Language:C++42097980
  • dqx

    databrickslabs/dqx

    Databricks framework to validate Data Quality of pySpark DataFrames

    Language:Python313718860
  • dqo

    dqops/dqo

    Data Quality and Observability platform for the whole data lifecycle, from profiling new data sources to full automation with Data Observability. Configure data quality checks from the UI or in YAML files, let DQOps run the data quality checks daily to detect data quality issues.

    Language:Java16591335
  • hi-primus/bumblebee

    🚕 A spreadsheet-like data preparation web app that works over Optimus (Pandas, Dask, cuDF, Dask-cuDF, Spark and Vaex)

    Language:Vue1411111435
  • data-observability-installer

    DataKitchen/data-observability-installer

    Installer for DataKitchen's Open Source Data Observability Products. Data breaks. Servers break. Your toolchain breaks. Ensure your team is the first to know and the first to solve with visibility across and down your data estate. Save time with simple, fast data quality test generation and execution. Trust your data, tools, and systems end to end.

    Language:Python12441712
  • SJTU-DMTai/awesome-ml-data-quality-papers

    Papers about training data quality management for ML models.

  • swiple

    Swiple/swiple

    Swiple enables you to easily observe, understand, validate and improve the quality of your data

    Language:Python842311
  • psebenick/data-profiling

    a set of scripts to pull meta data and data profiling metrics from relational database systems

    Language:Python777119
  • apicrafter/metacrafter

    Metadata and data identification tool and Python library. Identifies PII, common identifiers, language specific identifiers. Fully customizable and flexible rules

    Language:Python453275
  • opendatadiscovery/odd-collector

    Open-source metadata collector based on ODD Specification

    Language:Python4437813
  • VIDA-NYU/auctus

    Dataset search engine, discovering data from a variety of sources, profiling it, and allowing advanced queries on the index

    Language:Python44609
  • cleanlab/cleanlab-studio

    Client interface to Cleanlab Studio

    Language:Python3251810
  • tsegall/fta

    Metadata/data identification Java library. Identifies Semantic Type information (e.g. Gender, Age, Color, Country,...). Extensive country/language support. Extensible via user-defined plugins. Comprehensive Profiling support.

    Language:Java305643
  • dm4ml/gate

    Drift detection module for machine learning pipelines.

    Language:Python25282
  • ismaildawoodjee/GreatEx

    A project for exploring how Great Expectations can be used to ensure data quality and validate batches within a data pipeline defined in Airflow.

    Language:Python21427
  • raymon-ai/raymon

    The official http://raymon.ai data profiling and logging library.

    Language:Python183611
  • baligoyem/dataqtor

    🔍Your Data Quality Detector / Gain insight into your data and get it ready for use before you start working with it 💡📊🛠💎

    Language:Python16128
  • open-metadata/openmetadata-site

    Open Standard for Metadata. A Single place to Discover, Collaborate and Get your data right.

    Language:TypeScript164813
  • ahmadassaf/roomba

    A Node.js tool to examine the correctness of Open Data Metadata and build custom dataset profiles

    Language:JavaScript124153
  • CoDS-GCS/kglids

    Linked Data Science powered by Knowledge Graphs

    Language:Python113111
  • LieseB-1746743/data-cleaning

    Data cleaning tool.

    Language:JavaScript9104
  • SebastianSchmidl/distod

    DISTOD algorithm: Distributed discovery of bidirectional order dependencies

    Language:Scala9102