Data Engineering Workshops on some of the more popular libraries, frameworks and tech circa 2023-2024.
Data Engineers working with Python, Rust and Julia :P
This set of notebooks works through examples of how some pretty sophisticated data engineering can be done using Python Collections, Itertools and Functools. It uses the small MovieLens dataset.
- Basic Collections and the
Collections
Module: Notebook also
01 Numpy
- NumPy vs Python Collections Notebook also
02 Pandas
- Wrangling MovieLens with Pandas - Part 1: Getting Started, Load the MovieLens dataset: Notebook also
- Wrangling MovieLens with Pandas - Part 2: Playing with the Movies and Ratings data: Notebook also
03 Spark
-
01 - Setting up Spark locally (on Windows): Notebook also
-
02 - How to run Apache Spark based notebooks in Google Colab: Notebook also
-
Part 01: Overview, Starting Spark and Loading the data: Notebook or
-
Part 02: Data Analysis basics using tags.csv from the MovieLens dataset: Notebook or
04 Dask
- Distributed Data Analysis with Dask - Part 1: Getting Started, Load the MovieLens dataset: Notebook also
- Distributed Data Analysis with Dask - Part 2: Playing with the Movies data: Notebook also
05 Polars
- Polars with the MovieLens dataset - Getting Started, Load the MovieLens dataset, A quick look at Arrow, and some analysis: Notebook also
- 01 - 10+ minutes to Arrow+DataFusion+Ballista [WIP]: Notebook also
07 Ray
- [WIP]
- [WIP]
The "10+ minutes to XX" notebooks are just references, not to be run as actual workshop material. These are there to carry toy examples that "getting started" pages for XX carry. I have tried to ensure there's a 10+ minutes notebook for each data engineering library/framework considered here. While it may be interesting to go through these to quickly refresh the syntax and other idiosyncracies, the actual data munging happens in other notebooks.
01 Numpy
- Numpy User Guide (v1.23 as of this)
- Numpy Tutorials
- NumPy Basics: Arrays and Vectorized Computation from Wes Mckinney's Python for Data Analysis, 3E:
- Numpy is absurd
- 100 Numpy Exercises
- From Python to Numpy
02 Pandas
- Pandas (current stable version) User Guide
- 10 minutes to pandas
- Data Cleaning and Preparation from Wes Mckinney's Python for Data Analysis, 3E:
- Data Wrangling: Join, Combine, and Reshape from Wes Mckinney's Python for Data Analysis, 3E:
- Data Aggregation and Group Operations from Wes Mckinney's Python for Data Analysis, 3E:
- Effective Pandas | Matt Harrison, also here
- ...also from Matt harrison on github: effective pandas (book) and idiomatic pandas tutorial
- Pandas Exercises
- 100 Pandas Puzzles
03 Spark
- Spark User Guide
- The Internals of Apache Spark online book
- PySpark User Guide
- This is also available as live binder notebooks:
- Spark SQL and Built-in Functions Reference
- weak references, some dated but interesting
- PySpark Cheatsheet
- The "Data Savvy" YouTube Channel
04 Dask
The approach is different: Dask focuses on Task scheduling vs Spark's Map-Reduce
- 10 minutes to Dask
- 90-minute Dask tutorial video
- Talks and tutorials page
- The Dask tutorial notebooks
- The SciPy 2022 tutorial talk
- Journey of a Task
- High level performance of Pandas, Dask, Spark, and Arrow - from Dask Working Notes Blog
- Dask distributed
- Dask Task Graphs
- Tornado - used by Dask distributed
- For some Dask exercises, we may need GraphViz or Cytoscape and ipycytoscape
05 Polars
06 Arrow, Arrow DataFusion and Ballista
- Apache Arrow Official Native Rust Implementation
- pyArrow
- Apache Arrow Python Cookbook
- DataFusion User Guide
- Arrow DataFusion Python
- DataFusion Roadmap Epics
- Ballista on GitHub
- Arrow NumPy Integration
- Arrow Pandas Integration
07 Ray
Datasets we use:
- MovieLens 25M Dataset
- Wikipedia Movie Plots
- CMU Movie Summary Corpus also here
- MoviePlotEvents (CMU Movie Summary Corpus with Events) also here
- Netflix Prize Dataset
- Netflix data with 26+ joined attributes
There's a lot of interesting (interesting to me) tools, datasets and papers out there.
When there's time or need, we'll get to them as well.
- Arrow and pyArrow really warrant a deeper study. Maybe a gateway to Rust based data processing. Not really emerging anymore, a lot of very cool stuff is being done with this and datafusion, very interesting to explore.
- Apache Arrow Ballista is looking very interesting from a next gen distributed processing PoV
- PRQL, on github and PRQL Query. Also the PRQL Book.
- Mars and Project Mars on GitHub
- Modin
- Polars. Also, Polars Github Repo
- DuckDB, GitHub
- FoundationDB, GitHub
- Danfo.js - pandas like dataframes in JavaScript
- Velox also GitHub and Gluten, also GitHub
- I think there's something to be said about leveraging TPC benchmarks - we'll attend to this in due time. There's got to be a .md readme in this repo that'll list all the queries anyway. Yea, lemme do that soonish.
- Is there value in comparing formats? (Parquet)[https://parquet.apache.org/docs/], (Zarr)[https://zarr.readthedocs.io/en/stable/tutorial.html] etc.?
- Papers and Data - Scifi TV Shows (Scifi TV Show Plot Summaries & Events)
- Papers and Data - Story Cloze
- State Of The Art on paperswithcode (
- Only cause LLMs have been trending for a while - A Survey of Large Language Models
- SST (Stanford Sentiment Treebank), also
- ...
Kitchen sink of all other references I've found useful (or wonderful). There's so much to learn I tell you!
- How Query Engines Work
- Carnegie Mellon's Advanced Database Systems Playlist:
- Go here if the advanced database systems feels hard - CMU Intro to Database Systems (15-445/645 - Fall 2022), also course site
- Database Query Optimizers
- ¡Databases! – A Database Seminar Series (Fall 2022), also on CMU
- Hardware Accelerated Database Lectures (Fall 2018)
- Time Series Database Lectures (Fall 2017)
- The Databaseology Lectures (Fall 2015)
- Seven Databases in Seven Weeks (Fall 2014)
- This explanation for List Comprehensions
.