data-cleaning
There are 2815 repositories under data-cleaning topic.
cleanlab/cleanlab
The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
johnkerl/miller
Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
voxel51/fiftyone
The open-source tool for building high-quality datasets and computer vision models
unionai-oss/pandera
A light-weight, flexible, and expressive statistical data testing library
justmarkham/pandas-videos
Jupyter notebook and datasets from the pandas video series
justmarkham/DAT8
General Assembly's 2015 Data Science course in Washington, DC
hi-primus/optimus
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
sfirke/janitor
simple tools for data cleaning in R
data-forge/data-forge-ts
The JavaScript data transformation and analysis toolkit inspired by Pandas and LINQ.
skrub-data/skrub
Prepping tables for machine learning
ECNU-ICALK/EduChat
An open-source educational chat model from ICALK, East China Normal University. 开源中英教育对话大模型。(通用基座模型,GPU部署,数据清理) 致敬: LLaMA, MOSS, BELLE, Ziya, vLLM
schema-inspector/schema-inspector
Schema-Inspector is a simple JavaScript object sanitization and validation module.
akanz1/klib
Easy to use Python library of customized functions for cleaning and analyzing data.
encord-team/encord-active
The toolkit to test, validate, and evaluate your models and surface, curate, and prioritize the most valuable data for labeling.
data-cleaning/validate
Professional data validation for the R environment
msamogh/nonechucks
Deal with bad samples in your dataset dynamically, use Transforms as Filters, and more!
jim-schwoebel/voicebook
🗣️ A book and repo to get you started programming voice computing applications in Python (10 chapters and 200+ scripts).
Desbordante/desbordante-core
Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.
rasgointelligence/feature-engineering-tutorials
Data Science Feature Engineering and Selection Tutorials
probcomp/PClean
A domain-specific probabilistic programming language for scalable Bayesian data cleaning
ajaymache/data-analysis-using-python
Exploratory data analysis 📊using python 🐍of used car 🚘 database taken from ⓚ𝖆𝖌𝖌𝖑𝖊
genomoncology/FuzzTypes
Pydantic extension for annotating autocorrecting fields.
BdR76/CSVLint
CSV Lint plug-in for Notepad++ for syntax highlighting, csv validation, automatic column and datatype detecting, fixed width datasets, change datetime format, decimal separator, sort data, count unique values, convert to xml, json, sql etc. A plugin for data cleaning and working with messy data files.
ekstroem/dataMaid
An R package for data screening
jim-schwoebel/allie
🤖 An automated machine learning framework for audio, text, image, video, or .CSV files (50+ featurizers and 15+ model trainers). Python 3.6 required.
hi-primus/bumblebee
🚕 A spreadsheet-like data preparation web app that works over Optimus (Pandas, Dask, cuDF, Dask-cuDF, Spark and Vaex)
iam-mhaseeb/Skytrax-Data-Warehouse
A full data warehouse infrastructure with ETL pipelines running inside docker on Apache Airflow for data orchestration, AWS Redshift for cloud data warehouse and Metabase to serve the needs of data visualizations such as analytical dashboards.
CambioML/uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering
LLM-based text extraction from unstructured data like PDFs, Words and HTMLs. Transform and cluster the text into your desired format. Less information loss, more interpretation, and faster R&D!
KulikDM/pythresh
Outlier Detection Thresholding
charlesdedampierre/BunkaTopics
🗺️ Data Cleaning and Textual Data Visualization 🗺️
ChrisMuir/refinr
Cluster and merge similar string values: an R implementation of Open Refine clustering algorithms
Iqrar99/data-analytics-portfolio
Portfolio of data science and data analyst projects completed by me for academic, self learning, and hobby purposes.
aai-institute/pyDVL
pyDVL is a library of stable implementations of algorithms for data valuation and influence function computation
LoLei/redditcleaner
Cleans Reddit Text Data :scroll: :broom:
HoloClean/HoloClean-Legacy-deprecated
A Machine Learning System for Data Enrichment.
opendataval/opendataval
OpenDataVal: a Unified Benchmark for Data Valuation in Python (NeurIPS 2023)