shauryashaurya/learn-data-munging

Notes on Data Engineering with Pandas, PySpark, Dask, Ray, Arrow DataFusion, Polars etc.

Jupyter NotebookMIT

Data Munging Using X in Python, Rust & Julia

Data Engineering Workshops on some of the more popular libraries, frameworks and tech circa 2023-2024.

Data Engineers working with Python, Rust and Julia :P

Notebooks

00 Python Collections

This set of notebooks works through examples of how some pretty sophisticated data engineering can be done using Python Collections, Itertools and Functools. It uses the small MovieLens dataset.

Basic Collections and the Collections Module: Notebook also

01 Numpy

NumPy vs Python Collections Notebook also

02 Pandas

Wrangling MovieLens with Pandas - Part 1: Getting Started, Load the MovieLens dataset: Notebook also
Wrangling MovieLens with Pandas - Part 2: Playing with the Movies and Ratings data: Notebook also

03 Spark

01 - Toy introduction to the basics

01 - Setting up Spark locally (on Windows): Notebook also
02 - How to run Apache Spark based notebooks in Google Colab: Notebook also

02 - A set of notebooks exploring data wrangling in depth using the MovieLens dataset

Part 01: Overview, Starting Spark and Loading the data: Notebook or
Part 02: Data Analysis basics using tags.csv from the MovieLens dataset: Notebook or

04 Dask

Distributed Data Analysis with Dask - Part 1: Getting Started, Load the MovieLens dataset: Notebook also
Distributed Data Analysis with Dask - Part 2: Playing with the Movies data: Notebook also

05 Polars

Polars with the MovieLens dataset - Getting Started, Load the MovieLens dataset, A quick look at Arrow, and some analysis: Notebook also

06 Apache Arrow and DataFusion

01 - 10+ minutes to Arrow+DataFusion+Ballista [WIP]: Notebook also

07 Ray

[WIP]

99 Static: The TPC Benchmark Queries

[WIP]

Note

The "10+ minutes to XX" notebooks are just references, not to be run as actual workshop material. These are there to carry toy examples that "getting started" pages for XX carry. I have tried to ensure there's a 10+ minutes notebook for each data engineering library/framework considered here. While it may be interesting to go through these to quickly refresh the syntax and other idiosyncracies, the actual data munging happens in other notebooks.

References

01 Numpy

02 Pandas

03 Spark

Spark User Guide
The Internals of Apache Spark online book
PySpark User Guide
- This is also available as live binder notebooks:
  - Live Notebook: DataFrame
  - Live Notebook: Pandas API on Spark
Spark SQL and Built-in Functions Reference
weak references, some dated but interesting
PySpark Cheatsheet
The "Data Savvy" YouTube Channel

04 Dask

The approach is different: Dask focuses on Task scheduling vs Spark's Map-Reduce

10 minutes to Dask
90-minute Dask tutorial video
Talks and tutorials page
The Dask tutorial notebooks
The SciPy 2022 tutorial talk
Journey of a Task
High level performance of Pandas, Dask, Spark, and Arrow - from Dask Working Notes Blog
Dask distributed
Dask Task Graphs
Tornado - used by Dask distributed
For some Dask exercises, we may need GraphViz or Cytoscape and ipycytoscape

05 Polars

Polars User Guide and Getting Started

06 Arrow, Arrow DataFusion and Ballista

07 Ray

Future State / Miscellany

Datasets we use:

There's a lot of interesting (interesting to me) tools, datasets and papers out there.
When there's time or need, we'll get to them as well.

Arrow and pyArrow really warrant a deeper study. Maybe a gateway to Rust based data processing. Not really emerging anymore, a lot of very cool stuff is being done with this and datafusion, very interesting to explore.
Apache Arrow Ballista is looking very interesting from a next gen distributed processing PoV
PRQL, on github and PRQL Query. Also the PRQL Book.
Mars and Project Mars on GitHub
Modin
Polars. Also, Polars Github Repo
DuckDB, GitHub
FoundationDB, GitHub
Danfo.js - pandas like dataframes in JavaScript
Velox also GitHub and Gluten, also GitHub
I think there's something to be said about leveraging TPC benchmarks - we'll attend to this in due time. There's got to be a .md readme in this repo that'll list all the queries anyway. Yea, lemme do that soonish.
Is there value in comparing formats? (Parquet)[https://parquet.apache.org/docs/], (Zarr)[https://zarr.readthedocs.io/en/stable/tutorial.html] etc.?
Papers and Data - Scifi TV Shows (Scifi TV Show Plot Summaries & Events)
Papers and Data - Story Cloze
State Of The Art on paperswithcode (
Only cause LLMs have been trending for a while - A Survey of Large Language Models
SST (Stanford Sentiment Treebank), also
...

MOAR GIMME MOAR LINKS!!!

Kitchen sink of all other references I've found useful (or wonderful). There's so much to learn I tell you!

How Query Engines Work
Carnegie Mellon's Advanced Database Systems Playlist:
Go here if the advanced database systems feels hard - CMU Intro to Database Systems (15-445/645 - Fall 2022), also course site
Database Query Optimizers
¡Databases! – A Database Seminar Series (Fall 2022), also on CMU
Hardware Accelerated Database Lectures (Fall 2018)
Time Series Database Lectures (Fall 2017)
The Databaseology Lectures (Fall 2015)
Seven Databases in Seven Weeks (Fall 2014)
This explanation for List Comprehensions

.