/awesome-data-wrangling

A curated list of data wrangling resources

Apache License 2.0Apache-2.0

A curated list of data wrangling resources with a bias towards command line tools without steep learning curves.

Excel to CSV conversion

xlsx2csv Command line tool to convert xslx to csv. Fast and works for large xlsx files. Doesn't handle passwords.

libreoffice Use GUI or headless mode to convert xlsx to CSV. Doesn't seem to handle passwords or multiple sheets in headless mode

Apache POI Java APIs for manipulating various file formats based upon the Office Open XML standards. Can be used to extract text from spreadsheets, supports passwords etc. but can be memeory intensive.

Excel Streaming Reader Java streaming Excel reader using Apache POI - use for reading in large files without exhausting memory

Excelize Golang library that reads and writes XLSX file generated by Office Excel 2007 and later

Data Search/Filtering

ripgrep Extremely fast grep alternative.

JSON Processing

jq A lightweight and flexible command-line JSON processor

XML Processing

xmlstarlet Command line tools to transform, query, validate, and edit XML documents

CSV Processing

XSV CSV Toolkit Fast, command line toolkit for CSV manipulation and analysis. Written in RUST

awesomecsv A curated list of resoures for dealing CSV data.

awesome-csv A collection about the comma-separated values (CSV) world for rich structured data in (plain) text

Compression Tools

zstandard Fast, efficient compressor for small data sets (less than 100MB) leveraging pre trained dictionaries.

shoco A fast acii biased, entropy encoder, for short strings using trained bigrams. Trained on english by default, supports training custom models.

smaz Dictionary based compressor for very small strings (less than 100 bytes). By default uses english dictionary but can be customized via code.

PDF Processing

pdftabextract A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents. For an overview see the blog here

Data Exploration and Sharing

Redash Connect to any data source, easily visualize and share your data via dashboard. Open source and self hostable.

Superset - previously known as caravel Data exploration platform designed to be visual, intuitive, and interactive. From Airbnb, python based, supports druid alo with other SQL sources.

Metabase Visual analytics and dashboards from a wide range of SQL sources. Java based, SQL and non SQL modes, easy to share.

General Purpose

miller Like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON

Postal Address Processing

libpostal Libpostal is an open source project designed to provide fast, global address expansion and parsing using natural language processing techniques. Unlike several open source address parsing systems which rely explicit patterns and complex regular expressions, libpostal relies on a trained model derived from a corpus of global place names, national address patterns, and different languages terminology. It was created by Al Barrentine, initially for the Mapzen OpenVenues project. See intro post for details.

pelias.io A modular, open-source geocoder built on top of ElasticSearch for fast geocoding. Works with OpenStreetMap, OpenAddresses, Geonames, and Who's on First and can leverage libpostal for address parsing and expansion.

Datasets

Global Chain Store Names List 131k chain store names from OpenVenues

Test Data Generation

Generate fake data for testing and demos

phoney Command line program that accepts a template and outputs fake data. golang based.

faker.js and faker cli Generate fake data from a browser, cli or REST call. Node.js based and includes avatars. See demo

faker Python based module and command line tool for generating fake data

elizabeth Python module for generating fake data profiles. Claims to be faster than alternatives, simpler and more self contained.

mockeroo Web app for generating realistic test data. Free up to 1000 records.

generatedata.com Web app for generating test data. Generate up to 100 for free, 5000 for $20 or download from http://benkeen.github.io/generatedata/

dsgen-big Dataset generator for producing dirty data with duplicates, typos etc. Based on the origional Febrl dbgen code.

ranger An open source fake data generator.

kolpa An open source fake data generator in go

Compression Tools

Use parallel versions of gzip, bzip etc. where possible. Use difference in compression throughput, especially on modern servers.

lbzip2 Parallel bzip2 compression utility

pigz A parallel implementation of gzip for modern multi-processor, multi-core machines.

xz General-purpose data compression software with a high compression ratio and parallel support.

Fuzzy Matching

talisman Javascript NLP library that includes a large selection of phonetic fingerprints, fuzzy matching keys and distance metrics

Data Prep Tools

visidata A curses interface for exploring and arranging tabular data.

Process Orchestration

conductor Conductor is an orchestration engine from Netflix. Workflows are defined using a JSON based DSL and are either control tasks (fork, conditional etc) or application tasks (e.g. encode a file) that are executed on a remote machine. Process tasks are executed by remote (any language) workers that poll the workflow state. Java core but workers are just http clients.

camunda An open source Business Process and Decision Automation platform with support for BPMN and modelling. Process tasks are executed by task clients that are executed by the process engine. Java focused.

apache airflow Airflow is a platform to programmatically author, schedule and monitor workflows. Workflows are directed acyclic graphs (DAGs) of tasks that are executed, by the airflow scheduler, on an array of workers while following the specified dependencies. Python focused.