data-cleaning

There are 4484 repositories under data-cleaning topic.

  • cleanlab/cleanlab

    Cleanlab's open-source library is the standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.

    Language:Python10.9k88374850
  • fiftyone

    voxel51/fiftyone

    Refine high-quality datasets and visual AI models

    Language:Python9.9k671.7k665
  • miller

    johnkerl/miller

    Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON

    Language:Go9.5k69675227
  • unionai-oss/pandera

    A light-weight, flexible, and expressive statistical data testing library

    Language:Python4k201k358
  • justmarkham/pandas-videos

    Jupyter notebook and datasets from the pandas video series

    Language:Jupyter Notebook2.2k19881.9k
  • justmarkham/DAT8

    General Assembly's 2015 Data Science course in Washington, DC

    Language:Jupyter Notebook1.6k11111.1k
  • hi-primus/optimus

    :truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark

    Language:Python1.5k36218233
  • skrub

    skrub-data/skrub

    Machine learning with dataframes

    Language:Python1.5k20403156
  • sfirke/janitor

    simple tools for data cleaning in R

    Language:R1.4k35410130
  • data-forge/data-forge-ts

    The JavaScript data transformation and analysis toolkit inspired by Pandas and LINQ.

    Language:TypeScript1.4k2411377
  • OpenDCAI/DataFlow

    Easy Data Preparation with latest LLMs-based Operators and Pipelines.

    Language:Python1.3k163785
  • ECNU-ICALK/EduChat

    An open-source educational chat model from ICALK, East China Normal University. 开源中英教育对话大模型。(通用基座模型,GPU部署,数据清理) 致敬: LLaMA, MOSS, BELLE, Ziya, vLLM

    Language:Jupyter Notebook845192496
  • klib

    akanz1/klib

    Easy to use Python library of customized functions for cleaning and analyzing data.

    Language:Python52342655
  • schema-inspector/schema-inspector

    Schema-Inspector is a simple JavaScript object sanitization and validation module.

    Language:JavaScript503117745
  • encord-team/encord-active

    The toolkit to test, validate, and evaluate your models and surface, curate, and prioritize the most valuable data for labeling.

    Language:Python453101426
  • data-cleaning/validate

    Professional data validation for the R environment

    Language:R4251718242
  • desbordante-core

    Desbordante/desbordante-core

    Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.

    Language:C++41997980
  • jim-schwoebel/voicebook

    🗣️ A book and repo to get you started programming voice computing applications in Python (10 chapters and 200+ scripts).

    Language:Python386252587
  • msamogh/nonechucks

    Deal with bad samples in your dataset dynamically, use Transforms as Filters, and more!

    Language:Python37923227
  • DataWithBaraa/sql-data-warehouse-project

    A comprehensive guide to building a modern data warehouse with SQL Server, including ETL processes, data modeling, and analytics.

    Language:TSQL3232099
  • rasgointelligence/feature-engineering-tutorials

    Data Science Feature Engineering and Selection Tutorials

    Language:Jupyter Notebook28688100
  • ajaymache/data-analysis-using-python

    Exploratory data analysis 📊using python 🐍of used car 🚘 database taken from ⓚ𝖆𝖌𝖌𝖑𝖊

    Language:Jupyter Notebook227120101
  • probcomp/PClean

    A domain-specific probabilistic programming language for scalable Bayesian data cleaning

    Language:Julia227212033
  • CambioML/uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering

    LLM-based text extraction from unstructured data like PDFs, Words and HTMLs. Transform and cluster the text into your desired format. Less information loss, more interpretation, and faster R&D!

    Language:Python22441363
  • genomoncology/FuzzTypes

    Pydantic extension for annotating autocorrecting fields.

    Language:Python222504
  • BdR76/CSVLint

    CSV Lint plug-in for Notepad++ for syntax highlighting, csv validation, automatic column and datatype detecting, fixed width datasets, change datetime format, decimal separator, sort data, count unique values, convert to xml, json, sql etc. A plugin for data cleaning and working with messy data files.

    Language:C#20747916
  • charlesdedampierre/BunkaTopics

    🗺️ Data Cleaning and Textual Data Visualization 🗺️

    Language:Python1874817
  • jim-schwoebel/allie

    🤖 An automated machine learning framework for audio, text, image, video, or .CSV files (50+ featurizers and 15+ model trainers). Python 3.6 required.

    Language:Python14743835
  • ekstroem/dataMaid

    An R package for data screening

    Language:HTML14395926
  • hi-primus/bumblebee

    🚕 A spreadsheet-like data preparation web app that works over Optimus (Pandas, Dask, cuDF, Dask-cuDF, Spark and Vaex)

    Language:Vue1411111435
  • Hi-Dolphin/datamax

    A powerful multi-format file parsing, data cleaning, and AI annotation toolkit.

    Language:Python140
  • Skytrax-Data-Warehouse

    iam-mhaseeb/Skytrax-Data-Warehouse

    A full data warehouse infrastructure with ETL pipelines running inside docker on Apache Airflow for data orchestration, AWS Redshift for cloud data warehouse and Metabase to serve the needs of data visualizations such as analytical dashboards.

    Language:Python1388030
  • aai-institute/pyDVL

    pyDVL is a library of stable implementations of algorithms for data valuation and influence function computation

    Language:Python13663837
  • KulikDM/pythresh

    Outlier Detection Thresholding

    Language:Jupyter Notebook136145
  • xShaimaa/Data-Analysis-Projects

    Practices on data analysis including: cleaning, visualization and EDA on different datasets using Python, SQL, Power BI, etc.

    Language:Jupyter Notebook1315014
  • Iqrar99/data-analytics-portfolio

    Portfolio of data science and data analyst projects completed by me for academic, self learning, and hobby purposes.

    Language:Jupyter Notebook1136026