data-cleaning

There are 2815 repositories under data-cleaning topic.

  • cleanlab/cleanlab

    The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.

    Language:Python8.9k85344685
  • miller

    johnkerl/miller

    Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON

    Language:Go8.7k69637204
  • fiftyone

    voxel51/fiftyone

    The open-source tool for building high-quality datasets and computer vision models

    Language:Python6.9k531.5k512
  • unionai-oss/pandera

    A light-weight, flexible, and expressive statistical data testing library

    Language:Python3.1k18798284
  • justmarkham/pandas-videos

    Jupyter notebook and datasets from the pandas video series

    Language:Jupyter Notebook2.1k19781.9k
  • justmarkham/DAT8

    General Assembly's 2015 Data Science course in Washington, DC

    Language:Jupyter Notebook1.6k11311.1k
  • hi-primus/optimus

    :truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark

    Language:Python1.4k38218233
  • sfirke/janitor

    simple tools for data cleaning in R

    Language:R1.4k36401130
  • data-forge/data-forge-ts

    The JavaScript data transformation and analysis toolkit inspired by Pandas and LINQ.

    Language:TypeScript1.3k2510876
  • skrub-data/skrub

    Prepping tables for machine learning

    Language:Python1k2129490
  • ECNU-ICALK/EduChat

    An open-source educational chat model from ICALK, East China Normal University. 开源中英教育对话大模型。(通用基座模型,GPU部署,数据清理) 致敬: LLaMA, MOSS, BELLE, Ziya, vLLM

    Language:Python626161965
  • schema-inspector/schema-inspector

    Schema-Inspector is a simple JavaScript object sanitization and validation module.

    Language:JavaScript504137745
  • klib

    akanz1/klib

    Easy to use Python library of customized functions for cleaning and analyzing data.

    Language:Python47952151
  • encord-team/encord-active

    The toolkit to test, validate, and evaluate your models and surface, curate, and prioritize the most valuable data for labeling.

    Language:Python424101223
  • data-cleaning/validate

    Professional data validation for the R environment

    Language:R4021916737
  • msamogh/nonechucks

    Deal with bad samples in your dataset dynamically, use Transforms as Filters, and more!

    Language:Python37433227
  • jim-schwoebel/voicebook

    🗣️ A book and repo to get you started programming voice computing applications in Python (10 chapters and 200+ scripts).

    Language:Python371252582
  • desbordante-core

    Desbordante/desbordante-core

    Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.

    Language:C++36197162
  • rasgointelligence/feature-engineering-tutorials

    Data Science Feature Engineering and Selection Tutorials

    Language:Jupyter Notebook2689898
  • probcomp/PClean

    A domain-specific probabilistic programming language for scalable Bayesian data cleaning

    Language:Julia215222031
  • ajaymache/data-analysis-using-python

    Exploratory data analysis 📊using python 🐍of used car 🚘 database taken from ⓚ𝖆𝖌𝖌𝖑𝖊

    Language:Jupyter Notebook21314089
  • genomoncology/FuzzTypes

    Pydantic extension for annotating autocorrecting fields.

    Language:Python202502
  • BdR76/CSVLint

    CSV Lint plug-in for Notepad++ for syntax highlighting, csv validation, automatic column and datatype detecting, fixed width datasets, change datetime format, decimal separator, sort data, count unique values, convert to xml, json, sql etc. A plugin for data cleaning and working with messy data files.

    Language:C#1456718
  • ekstroem/dataMaid

    An R package for data screening

    Language:HTML141105926
  • jim-schwoebel/allie

    🤖 An automated machine learning framework for audio, text, image, video, or .CSV files (50+ featurizers and 15+ model trainers). Python 3.6 required.

    Language:Python13953836
  • hi-primus/bumblebee

    🚕 A spreadsheet-like data preparation web app that works over Optimus (Pandas, Dask, cuDF, Dask-cuDF, Spark and Vaex)

    Language:Vue1371211435
  • Skytrax-Data-Warehouse

    iam-mhaseeb/Skytrax-Data-Warehouse

    A full data warehouse infrastructure with ETL pipelines running inside docker on Apache Airflow for data orchestration, AWS Redshift for cloud data warehouse and Metabase to serve the needs of data visualizations such as analytical dashboards.

    Language:Python1328026
  • CambioML/uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering

    LLM-based text extraction from unstructured data like PDFs, Words and HTMLs. Transform and cluster the text into your desired format. Less information loss, more interpretation, and faster R&D!

    Language:Python12851142
  • KulikDM/pythresh

    Outlier Detection Thresholding

    Language:Jupyter Notebook118225
  • charlesdedampierre/BunkaTopics

    🗺️ Data Cleaning and Textual Data Visualization 🗺️

    Language:Python1043411
  • ChrisMuir/refinr

    Cluster and merge similar string values: an R implementation of Open Refine clustering algorithms

    Language:C++1028155
  • Iqrar99/data-analytics-portfolio

    Portfolio of data science and data analyst projects completed by me for academic, self learning, and hobby purposes.

    Language:Jupyter Notebook876022
  • aai-institute/pyDVL

    pyDVL is a library of stable implementations of algorithms for data valuation and influence function computation

    Language:Python8543479
  • LoLei/redditcleaner

    Cleans Reddit Text Data :scroll: :broom:

    Language:Python79402
  • HoloClean/HoloClean-Legacy-deprecated

    A Machine Learning System for Data Enrichment.

    Language:Python752217522
  • opendataval/opendataval

    OpenDataVal: a Unified Benchmark for Data Valuation in Python (NeurIPS 2023)

    Language:Python75176