Ai+: Hands on Parallel Computing with Dask and Pandas

Working on a single computer limits how much and how fast you can process data. Most real-world datasets are bigger than a single computer can process, so learning a parallel computing framework becomes increasingly necessary to be productive. In this session, we will go over the Dask graph computing engine's major components from a hands-on perspective and how to leverage existing Pandas code to build scalable workflows.

If you use Python for data analysis in any capacity and want to work with datasets bigger than your computer can handle but have been unsure where to start, this is for you.

The majority of our session will be in Jupyter notebooks and writing code hands-on. Please review the environment setup guide ahead of time. Also, check back prior to session for updated notebook commits!

Learning Objectives

Part 0: Introductions, Configuration, and Review

  • Configure our Python environments
  • Briefly review the fundamentals of Pandas

Part 1: Intro to Parallelism

  • Explain the core concepts of parallel computing
  • Be familiar with types of parallel processing provided by Dask
  • Identify the major components of Dask: Collection Types and its Scheduler

Part 2: Intro to Dask

  • Understand how graphs represent tasks with dependencies
  • Examining tasks in real-time using the Dask dashboard
  • Assess trade-offs between various Dask data types.

Part 3: Dask + Pandas

  • Describe Distributed DataFrames
  • Common use cases of Dask and Pandas