/learn-data-munging

Notes on Data Engineering with Pandas, PySpark, Dask, Ray, Arrow DataFusion, Polars etc.

Primary LanguageJupyter NotebookMIT LicenseMIT

Data Munging Using *X* in Python, Rust & Julia

Data Engineering Workshops on some of the more popular libraries, frameworks and tech circa 2023-2024.

Data Wrangling with Python, Rust and Julia, Image © Shaurya Agarwal, created using Dalle and GIMP

Data Engineers working with Python, Rust and Julia :P


Notebooks

00 Python Collections

This set of notebooks works through examples of how some pretty sophisticated data engineering can be done using Python Collections, Itertools and Functools. It uses the small MovieLens dataset.

  • Basic Collections and the Collections Module: Notebook also Open In Colab
  • NumPy vs Python Collections Notebook also Open In Colab
  • Wrangling MovieLens with Pandas - Part 1: Getting Started, Load the MovieLens dataset: Notebook also Open In Colab
  • Wrangling MovieLens with Pandas - Part 2: Playing with the Movies and Ratings data: Notebook also Open In Colab

01 - Toy introduction to the basics

  • 01 - Setting up Spark locally (on Windows): Notebook also Open In Colab

  • 02 - How to run Apache Spark based notebooks in Google Colab: Notebook also Open In Colab

02 - A set of notebooks exploring data wrangling in depth using the MovieLens dataset

  • Part 01: Overview, Starting Spark and Loading the data: Notebook or Open In Colab

  • Part 02: Data Analysis basics using tags.csv from the MovieLens dataset: Notebook or Open In Colab

04 Dask

  • Distributed Data Analysis with Dask - Part 1: Getting Started, Load the MovieLens dataset: Notebook also Open In Colab
  • Distributed Data Analysis with Dask - Part 2: Playing with the Movies data: Notebook also Open In Colab
  • Polars with the MovieLens dataset - Getting Started, Load the MovieLens dataset, A quick look at Arrow, and some analysis: Notebook also Open In Colab
  • 01 - 10+ minutes to Arrow+DataFusion+Ballista [WIP]: Notebook also Open In Colab

07 Ray

  • [WIP]

99 Static: The TPC Benchmark Queries

  • [WIP]

Note

The "10+ minutes to XX" notebooks are just references, not to be run as actual workshop material. These are there to carry toy examples that "getting started" pages for XX carry. I have tried to ensure there's a 10+ minutes notebook for each data engineering library/framework considered here. While it may be interesting to go through these to quickly refresh the syntax and other idiosyncracies, the actual data munging happens in other notebooks.

References

04 Dask

The approach is different: Dask focuses on Task scheduling vs Spark's Map-Reduce

07 Ray

Future State / Miscellany

Datasets we use:

There's a lot of interesting (interesting to me) tools, datasets and papers out there.
When there's time or need, we'll get to them as well.

MOAR GIMME MOAR LINKS!!!

Kitchen sink of all other references I've found useful (or wonderful). There's so much to learn I tell you!

.