data-cleaning
There are 4484 repositories under data-cleaning topic.
cleanlab/cleanlab
Cleanlab's open-source library is the standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
voxel51/fiftyone
Refine high-quality datasets and visual AI models
johnkerl/miller
Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
unionai-oss/pandera
A light-weight, flexible, and expressive statistical data testing library
justmarkham/pandas-videos
Jupyter notebook and datasets from the pandas video series
justmarkham/DAT8
General Assembly's 2015 Data Science course in Washington, DC
hi-primus/optimus
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
skrub-data/skrub
Machine learning with dataframes
sfirke/janitor
simple tools for data cleaning in R
data-forge/data-forge-ts
The JavaScript data transformation and analysis toolkit inspired by Pandas and LINQ.
OpenDCAI/DataFlow
Easy Data Preparation with latest LLMs-based Operators and Pipelines.
ECNU-ICALK/EduChat
An open-source educational chat model from ICALK, East China Normal University. 开源中英教育对话大模型。(通用基座模型,GPU部署,数据清理) 致敬: LLaMA, MOSS, BELLE, Ziya, vLLM
akanz1/klib
Easy to use Python library of customized functions for cleaning and analyzing data.
schema-inspector/schema-inspector
Schema-Inspector is a simple JavaScript object sanitization and validation module.
encord-team/encord-active
The toolkit to test, validate, and evaluate your models and surface, curate, and prioritize the most valuable data for labeling.
data-cleaning/validate
Professional data validation for the R environment
Desbordante/desbordante-core
Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.
jim-schwoebel/voicebook
🗣️ A book and repo to get you started programming voice computing applications in Python (10 chapters and 200+ scripts).
msamogh/nonechucks
Deal with bad samples in your dataset dynamically, use Transforms as Filters, and more!
DataWithBaraa/sql-data-warehouse-project
A comprehensive guide to building a modern data warehouse with SQL Server, including ETL processes, data modeling, and analytics.
rasgointelligence/feature-engineering-tutorials
Data Science Feature Engineering and Selection Tutorials
ajaymache/data-analysis-using-python
Exploratory data analysis 📊using python 🐍of used car 🚘 database taken from ⓚ𝖆𝖌𝖌𝖑𝖊
probcomp/PClean
A domain-specific probabilistic programming language for scalable Bayesian data cleaning
CambioML/uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering
LLM-based text extraction from unstructured data like PDFs, Words and HTMLs. Transform and cluster the text into your desired format. Less information loss, more interpretation, and faster R&D!
genomoncology/FuzzTypes
Pydantic extension for annotating autocorrecting fields.
BdR76/CSVLint
CSV Lint plug-in for Notepad++ for syntax highlighting, csv validation, automatic column and datatype detecting, fixed width datasets, change datetime format, decimal separator, sort data, count unique values, convert to xml, json, sql etc. A plugin for data cleaning and working with messy data files.
charlesdedampierre/BunkaTopics
🗺️ Data Cleaning and Textual Data Visualization 🗺️
jim-schwoebel/allie
🤖 An automated machine learning framework for audio, text, image, video, or .CSV files (50+ featurizers and 15+ model trainers). Python 3.6 required.
ekstroem/dataMaid
An R package for data screening
hi-primus/bumblebee
🚕 A spreadsheet-like data preparation web app that works over Optimus (Pandas, Dask, cuDF, Dask-cuDF, Spark and Vaex)
Hi-Dolphin/datamax
A powerful multi-format file parsing, data cleaning, and AI annotation toolkit.
iam-mhaseeb/Skytrax-Data-Warehouse
A full data warehouse infrastructure with ETL pipelines running inside docker on Apache Airflow for data orchestration, AWS Redshift for cloud data warehouse and Metabase to serve the needs of data visualizations such as analytical dashboards.
aai-institute/pyDVL
pyDVL is a library of stable implementations of algorithms for data valuation and influence function computation
KulikDM/pythresh
Outlier Detection Thresholding
xShaimaa/Data-Analysis-Projects
Practices on data analysis including: cleaning, visualization and EDA on different datasets using Python, SQL, Power BI, etc.
Iqrar99/data-analytics-portfolio
Portfolio of data science and data analyst projects completed by me for academic, self learning, and hobby purposes.