Ai+: Hands on Parallel Computing with Dask and Pandas

Working on a single computer limits how much and how fast you can process data. Most real-world datasets are bigger than a single computer can process, so learning a parallel computing framework becomes increasingly necessary to be productive. In this session, we will go over the Dask graph computing engine's major components from a hands-on perspective and how to leverage existing Pandas code to build scalable workflows.

If you use Python for data analysis in any capacity and want to work with datasets bigger than your computer can handle but have been unsure where to start, this is for you.

The majority of our session will be in Jupyter notebooks and writing code hands-on. Please review the environment setup guide ahead of time. Also, check back prior to session for updated notebook commits!

Learning Objectives

Part 0: Introductions, Configuration, and Review

Configure our Python environments
Briefly review the fundamentals of Pandas

Part 1: Intro to Parallelism

Explain the core concepts of parallel computing
Be familiar with types of parallel processing provided by Dask
Identify the major components of Dask: Collection Types and its Scheduler

Part 2: Intro to Dask

Understand how graphs represent tasks with dependencies
Examining tasks in real-time using the Dask dashboard
Assess trade-offs between various Dask data types.

Part 3: Dask + Pandas