Dask is a native Python library for parallel computing. This tutorial shows how you can scale data science from a laptop to a cluster using Dask.
Create the conda environment and launch Jupyter Lab (or notebook).
conda env create -f environment.yml
conda activate dask-speed
jupyter lab
- Get data: Pulls files from S3
- Laptop: Run analysis and train models using non-parallel Python packages. Try to load larger data, then run out of memory.
- Dask laptop: Same analysis with larger data using Dask, still on laptop. Slow, but executes.
- Dask cluster: Run analysis with a Dask cluster, super fast!
To run in Saturn Cloud, create a new Project with the following settings:
Then launch Jupyter, open a new terminal window and clone the repo:
git clone https://github.com/rikturr/getting-up-to-speed-with-dask