Dask provides multi-core execution on larger-than-memory datasets.
We can think of dask at a high and a low level
- High level collections: Dask provides high-level Array, Bag, and DataFrame collections that mimic NumPy, lists, and Pandas but can operate in parallel on datasets that don't fit into main memory. Dask's high-level collections are alternatives to NumPy and Pandas for large datasets.
- Low Level schedulers: Dask provides dynamic task schedulers that
execute task graphs in parallel. These execution engines power the
high-level collections mentioned above but can also power custom,
user-defined workloads. These schedulers are low-latency (around 1ms) and
work hard to run computations in a small memory footprint. Dask's
schedulers are an alternative to direct use of
threading
ormultiprocessing
libraries in complex cases or other task scheduling systems likeLuigi
orIPython parallel
.
Different users operate at different levels but it is useful to understand
both. This tutorial will interleave between high-level use of dask.array
and
dask.dataframe
(even sections) and low-level use of dask graphs and
schedulers (odd sections.)
You will need the following core libraries
conda install numpy pandas h5py Pillow matplotlib scipy toolz pytables
And a recently updated version of dask
conda/pip install dask
You may find the following libraries helpful for some exercises
pip install castra graphviz
You should clone this repository
git clone http://github.com/dask/dask-tutorial
and then run this script to prepare artificial data.
cd dask-tutorial
python prep.py
- Reference
- Ask for help
dask
tag on Stack Overflow- github issues for bug reports and feature requests
- blaze-dev mailing list for community discussion
- Please ask questions during a live tutorial
Introduction - slides
-
Arrays - slides
-
Task graphs and other fundamentals - slides
-
DataFrames
-
Imperative Programming
-
Bags of semi-structured data
-
Homework - large datasets with which to play at home
End - slides